[ceph-users] will crush rule be used during object relocation in OSD failure ?

Discussion:

ST Wong (ITSC)

2018-11-23 16:00:52 UTC

Hi all,

We've 8 osd hosts, 4 in room 1 and 4 in room2.

A pool with size = 3 using following crush map is created, to cater for room failure.

rule multiroom {
id 0
type replicated
min_size 2
max_size 4
step take default
step choose firstn 2 type room
step chooseleaf firstn 2 type host
step emit
}

We're expecting:

1.for each object, there are always 2 replicas in one room and 1 replica in other room making size=3. But we can't control which room has 1 or 2 replicas.

2.in case an osd host fails, ceph will assign remaining osds to the same PG to hold replicas on the failed osd host. Selection is based on crush rule of the pool, thus maintaining the same failure domain - won't make all replicas in the same room.

3.in case of entire room with 1 replica fails, the pool will remain degraded but won't do any replica relocation.

4. in case of entire room with 2 replicas fails, ceph will make use of osds in the surviving room and making 2 replicas. Pool will not be writeable before all objects are made 2 copies (unless we make pool size=4?). Then when recovery is complete, pool will remain in degraded state until the failed room recover.

Is our understanding correct? Thanks a lot.
Will do some simulation later to verify.

Regards,
/stwong

Maged Mokhtar

2018-11-24 12:44:58 UTC

Permalink

Post by ST Wong (ITSC)
Hi all,
We've 8 osd hosts, 4 in room 1 and 4 in room2.
A pool with size = 3 using following crush map is created, to cater for room failure.
rule multiroom {
        id 0
        type replicated
        min_size 2
        max_size 4
        step take default
        step choose firstn 2 type room
        step chooseleaf firstn 2 type host
        step emit
}
1.for each object, there are always 2 replicas in one room and 1
replica in other room making size=3. But we can't control which room
has 1 or 2 replicas.
2.in case an osd host fails, ceph will assign remaining osds to the
same PG to hold replicas on the failed osd host. Selection is based on
crush rule of the pool, thus maintaining the same failure domain -
won't make all replicas in the same room.
3.in case of entire room with 1 replica fails, the pool will remain
degraded but won't do any replica relocation.
4. in case of entire room with 2 replicas fails, ceph will make use of
osds in the surviving room and making 2 replicas. Pool will not be
writeable before all objects are made 2 copies (unless we make pool
size=4?). Then when recovery is complete, pool will remain in
degraded state until the failed room recover.
Is our understanding correct? Thanks a lot.
Will do some simulation later to verify.
Regards,
/stwong
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

I think this is correct. To re-phrase 2) : all PGs on the failed node
will be re-distributed on several other hosts within the same room.

Since some PGs will have 2 replicas in 1 room whereas some other PGs
will have 2 replicas in the other room, i tend to dis-like such setup as
it is not symmetric,some PGs will suffer more than others in case on
room failure, you failure domain is not symmetric. Besides more
importantly, as you stated in 4, you cluster will be down while these
unfortunate PGs recover ( statistically that is half your data ). I
would prefer in such case you would use a size=4 min_size=2 setup.

/Maged

Gregory Farnum

2018-11-26 13:27:17 UTC

Permalink

Right.

Post by ST Wong (ITSC)
2.in case an osd host fails, ceph will assign remaining osds to the same
PG to hold replicas on the failed osd host. Selection is based on crush
rule of the pool, thus maintaining the same failure domain - won't make all
replicas in the same room.

Yes, if a host fails the copies it held will be replaced by new copies in
the same room.

Post by ST Wong (ITSC)
3.in case of entire room with 1 replica fails, the pool will remain
degraded but won't do any replica relocation.

Right.

Post by ST Wong (ITSC)
4. in case of entire room with 2 replicas fails, ceph will make use of
osds in the surviving room and making 2 replicas. Pool will not be
writeable before all objects are made 2 copies (unless we make pool
size=4?). Then when recovery is complete, pool will remain in degraded
state until the failed room recover.

Hmm, I'm actually not sure if this will work out â because CRUSH is
hierarchical, it will keep trying to select hosts from the dead room and
will fill out the location vector's first two spots with -1. It could be
that Ceph will skip all those "nonexistent" entries and just pick the two
copies from slots 3 and 4, but it might not. You should test this carefully
and report back!
-Greg

Post by ST Wong (ITSC)
Is our understanding correct? Thanks a lot.
Will do some simulation later to verify.
Regards,
/stwong
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com