Discussion:
Erasure code ruleset for small cluster
(too old to reply)
Caspar Smit
2018-02-02 16:13:59 UTC
Permalink
Raw Message
Hi all,

I'd like to setup a small cluster (5 nodes) using erasure coding. I would
like to use k=5 and m=3.
Normally you would need a minimum of 8 nodes (preferably 9 or more) for
this.

Then i found this blog:
https://ceph.com/planet/erasure-code-on-small-clusters/

This sounded ideal to me so i started building a test setup using the 5+3
profile

Changed the erasure ruleset to:

rule erasure_ruleset {
ruleset X
type erasure
min_size 8
max_size 8
step take default
step choose indep 4 type host
step choose indep 2 type osd
step emit
}

Created a pool and now every PG has 8 shards in 4 hosts with 2 shards each,
perfect.

But then i tested a node failure, no problem again, all PG's stay active
(most undersized+degraded, but still active). Then after 10 minutes the
OSD's on the failed node were all marked as out, as expected.

I waited for the data to be recovered to the other (fifth) node but that
doesn't happen, there is no recovery whatsoever.

Only when i completely remove the down+out OSD's from the cluster the data
is recovered.

My guess is that the "step choose indep 4 type host" chooses 4 hosts
beforehand to store data on.

Would it be possible to do something like this:

Create a 5+3 EC profile, every hosts has a maximum of 2 shards (so 4 hosts
are needed), in case of node failure -> recover data from failed node to
fifth node.

Thank you in advance,
Caspar
Gregory Farnum
2018-02-02 18:09:31 UTC
Permalink
Raw Message
Post by Caspar Smit
Hi all,
I'd like to setup a small cluster (5 nodes) using erasure coding. I would
like to use k=5 and m=3.
Normally you would need a minimum of 8 nodes (preferably 9 or more) for
this.
https://ceph.com/planet/erasure-code-on-small-clusters/
This sounded ideal to me so i started building a test setup using the 5+3
profile
rule erasure_ruleset {
ruleset X
type erasure
min_size 8
max_size 8
step take default
step choose indep 4 type host
step choose indep 2 type osd
step emit
}
Created a pool and now every PG has 8 shards in 4 hosts with 2 shards each,
perfect.
But then i tested a node failure, no problem again, all PG's stay active
(most undersized+degraded, but still active). Then after 10 minutes the
OSD's on the failed node were all marked as out, as expected.
I waited for the data to be recovered to the other (fifth) node but that
doesn't happen, there is no recovery whatsoever.
Only when i completely remove the down+out OSD's from the cluster the data
is recovered.
My guess is that the "step choose indep 4 type host" chooses 4 hosts
beforehand to store data on.
step take default
take the default root.
Post by Caspar Smit
step choose indep 4 type host
Choose four hosts that exist under the root. *Note that at this layer,
it has no idea what OSDs exist under the hosts.*
Post by Caspar Smit
step choose indep 2 type osd
Within the host chosen above, choose two OSDs.


Marking out an OSD does not change the weight of its host, because
that causes massive data movement across the whole cluster on a single
disk failure. The "chooseleaf" commands deal with this (because if
they fail to pick an OSD within the host, they will back out and go
for a different host), but that doesn't work when you're doing
independent "choose" steps.

I don't remember the implementation details well enough to be sure,
but you *might* be able to do something like

step take default
step chooseleaf indep 4 type host
step take default
step chooseleaf indep 4 type host
step emit

And that will make sure you get at least 4 OSDs involved?
-Greg
Post by Caspar Smit
Create a 5+3 EC profile, every hosts has a maximum of 2 shards (so 4 hosts
are needed), in case of node failure -> recover data from failed node to
fifth node.
Thank you in advance,
Caspar
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Caspar Smit
2018-02-05 11:23:31 UTC
Permalink
Raw Message
Hi Gregory,

Thanks for your answer.

I had to add another step emit to your suggestion to make it work:

step take default
step chooseleaf indep 4 type host
step emit
step take default
step chooseleaf indep 4 type host
step emit

However, now the same OSD is chosen twice for every PG:

# crushtool --test -i compiled-crushmap-new --rule 1 --show-mappings --x 1
--num-rep 8
CRUSH rule 1 x 1 [5,9,3,12,5,9,3,12]

I'm wondering why something like this won't work (crushtool test ends up
empty):

step take default
step chooseleaf indep 4 type host
step choose indep 2 type osd
step emit


# crushtool --test -i compiled-crushmap-new --rule 1 --show-mappings --x 1
--num-rep 8
CRUSH rule 1 x 1 []

Kind regards,
Caspar Smit
Post by Gregory Farnum
Post by Caspar Smit
Hi all,
I'd like to setup a small cluster (5 nodes) using erasure coding. I would
like to use k=5 and m=3.
Normally you would need a minimum of 8 nodes (preferably 9 or more) for
this.
https://ceph.com/planet/erasure-code-on-small-clusters/
This sounded ideal to me so i started building a test setup using the 5+3
profile
rule erasure_ruleset {
ruleset X
type erasure
min_size 8
max_size 8
step take default
step choose indep 4 type host
step choose indep 2 type osd
step emit
}
Created a pool and now every PG has 8 shards in 4 hosts with 2 shards
each,
Post by Caspar Smit
perfect.
But then i tested a node failure, no problem again, all PG's stay active
(most undersized+degraded, but still active). Then after 10 minutes the
OSD's on the failed node were all marked as out, as expected.
I waited for the data to be recovered to the other (fifth) node but that
doesn't happen, there is no recovery whatsoever.
Only when i completely remove the down+out OSD's from the cluster the
data
Post by Caspar Smit
is recovered.
My guess is that the "step choose indep 4 type host" chooses 4 hosts
beforehand to store data on.
step take default
take the default root.
Post by Caspar Smit
step choose indep 4 type host
Choose four hosts that exist under the root. *Note that at this layer,
it has no idea what OSDs exist under the hosts.*
Post by Caspar Smit
step choose indep 2 type osd
Within the host chosen above, choose two OSDs.
Marking out an OSD does not change the weight of its host, because
that causes massive data movement across the whole cluster on a single
disk failure. The "chooseleaf" commands deal with this (because if
they fail to pick an OSD within the host, they will back out and go
for a different host), but that doesn't work when you're doing
independent "choose" steps.
I don't remember the implementation details well enough to be sure,
but you *might* be able to do something like
step take default
step chooseleaf indep 4 type host
step take default
step chooseleaf indep 4 type host
step emit
And that will make sure you get at least 4 OSDs involved?
-Greg
Post by Caspar Smit
Create a 5+3 EC profile, every hosts has a maximum of 2 shards (so 4
hosts
Post by Caspar Smit
are needed), in case of node failure -> recover data from failed node to
fifth node.
Thank you in advance,
Caspar
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Gregory Farnum
2018-02-05 16:57:06 UTC
Permalink
Raw Message
Post by Caspar Smit
Hi Gregory,
Thanks for your answer.
step take default
step chooseleaf indep 4 type host
step emit
step take default
step chooseleaf indep 4 type host
step emit
# crushtool --test -i compiled-crushmap-new --rule 1 --show-mappings --x 1
--num-rep 8
CRUSH rule 1 x 1 [5,9,3,12,5,9,3,12]
Oh, that must be because it has the exact same inputs on every run.
Hrmmm...Sage, is there a way to seed them differently? Or do you have any
other ideas? :/
Post by Caspar Smit
I'm wondering why something like this won't work (crushtool test ends up
step take default
step chooseleaf indep 4 type host
step choose indep 2 type osd
step emit
Chooseleaf is telling crush to go all the way down to individual OSDs. I’m
not quite sure what happens when you then tell it to pick OSDs again but
obviously it’s failing (as the instruction is nonsense) and emitting an
empty list.
Post by Caspar Smit
# crushtool --test -i compiled-crushmap-new --rule 1 --show-mappings --x 1
--num-rep 8
CRUSH rule 1 x 1 []
Kind regards,
Caspar Smit
Post by Gregory Farnum
Post by Caspar Smit
Hi all,
I'd like to setup a small cluster (5 nodes) using erasure coding. I
would
Post by Caspar Smit
like to use k=5 and m=3.
Normally you would need a minimum of 8 nodes (preferably 9 or more) for
this.
https://ceph.com/planet/erasure-code-on-small-clusters/
This sounded ideal to me so i started building a test setup using the
5+3
Post by Caspar Smit
profile
rule erasure_ruleset {
ruleset X
type erasure
min_size 8
max_size 8
step take default
step choose indep 4 type host
step choose indep 2 type osd
step emit
}
Created a pool and now every PG has 8 shards in 4 hosts with 2 shards
each,
Post by Caspar Smit
perfect.
But then i tested a node failure, no problem again, all PG's stay active
(most undersized+degraded, but still active). Then after 10 minutes the
OSD's on the failed node were all marked as out, as expected.
I waited for the data to be recovered to the other (fifth) node but that
doesn't happen, there is no recovery whatsoever.
Only when i completely remove the down+out OSD's from the cluster the
data
Post by Caspar Smit
is recovered.
My guess is that the "step choose indep 4 type host" chooses 4 hosts
beforehand to store data on.
step take default
take the default root.
Post by Caspar Smit
step choose indep 4 type host
Choose four hosts that exist under the root. *Note that at this layer,
it has no idea what OSDs exist under the hosts.*
Post by Caspar Smit
step choose indep 2 type osd
Within the host chosen above, choose two OSDs.
Marking out an OSD does not change the weight of its host, because
that causes massive data movement across the whole cluster on a single
disk failure. The "chooseleaf" commands deal with this (because if
they fail to pick an OSD within the host, they will back out and go
for a different host), but that doesn't work when you're doing
independent "choose" steps.
I don't remember the implementation details well enough to be sure,
but you *might* be able to do something like
step take default
step chooseleaf indep 4 type host
step take default
step chooseleaf indep 4 type host
step emit
And that will make sure you get at least 4 OSDs involved?
-Greg
Post by Caspar Smit
Create a 5+3 EC profile, every hosts has a maximum of 2 shards (so 4
hosts
Post by Caspar Smit
are needed), in case of node failure -> recover data from failed node to
fifth node.
Thank you in advance,
Caspar
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Sage Weil
2018-02-05 17:00:50 UTC
Permalink
Raw Message
Post by Gregory Farnum
Post by Caspar Smit
Hi Gregory,
Thanks for your answer.
step take default
step chooseleaf indep 4 type host
step emit
step take default
step chooseleaf indep 4 type host
step emit
# crushtool --test -i compiled-crushmap-new --rule 1 --show-mappings --x 1
--num-rep 8
CRUSH rule 1 x 1 [5,9,3,12,5,9,3,12]
Oh, that must be because it has the exact same inputs on every run.
Hrmmm...Sage, is there a way to seed them differently? Or do you have any
other ideas? :/
Nope. The CRUSH rule isn't meant to work like that..
Post by Gregory Farnum
Post by Caspar Smit
I'm wondering why something like this won't work (crushtool test ends up
step take default
step chooseleaf indep 4 type host
Yeah, s/chooseleaf/choose/ and it should work!
s
Post by Gregory Farnum
Post by Caspar Smit
step choose indep 2 type osd
step emit
Chooseleaf is telling crush to go all the way down to individual OSDs. I’m
not quite sure what happens when you then tell it to pick OSDs again but
obviously it’s failing (as the instruction is nonsense) and emitting an
empty list.
Post by Caspar Smit
# crushtool --test -i compiled-crushmap-new --rule 1 --show-mappings --x 1
--num-rep 8
CRUSH rule 1 x 1 []
Kind regards,
Caspar Smit
Post by Gregory Farnum
Post by Caspar Smit
Hi all,
I'd like to setup a small cluster (5 nodes) using erasure coding. I
would
Post by Caspar Smit
like to use k=5 and m=3.
Normally you would need a minimum of 8 nodes (preferably 9 or more) for
this.
https://ceph.com/planet/erasure-code-on-small-clusters/
This sounded ideal to me so i started building a test setup using the
5+3
Post by Caspar Smit
profile
rule erasure_ruleset {
ruleset X
type erasure
min_size 8
max_size 8
step take default
step choose indep 4 type host
step choose indep 2 type osd
step emit
}
Created a pool and now every PG has 8 shards in 4 hosts with 2 shards
each,
Post by Caspar Smit
perfect.
But then i tested a node failure, no problem again, all PG's stay active
(most undersized+degraded, but still active). Then after 10 minutes the
OSD's on the failed node were all marked as out, as expected.
I waited for the data to be recovered to the other (fifth) node but that
doesn't happen, there is no recovery whatsoever.
Only when i completely remove the down+out OSD's from the cluster the
data
Post by Caspar Smit
is recovered.
My guess is that the "step choose indep 4 type host" chooses 4 hosts
beforehand to store data on.
step take default
take the default root.
Post by Caspar Smit
step choose indep 4 type host
Choose four hosts that exist under the root. *Note that at this layer,
it has no idea what OSDs exist under the hosts.*
Post by Caspar Smit
step choose indep 2 type osd
Within the host chosen above, choose two OSDs.
Marking out an OSD does not change the weight of its host, because
that causes massive data movement across the whole cluster on a single
disk failure. The "chooseleaf" commands deal with this (because if
they fail to pick an OSD within the host, they will back out and go
for a different host), but that doesn't work when you're doing
independent "choose" steps.
I don't remember the implementation details well enough to be sure,
but you *might* be able to do something like
step take default
step chooseleaf indep 4 type host
step take default
step chooseleaf indep 4 type host
step emit
And that will make sure you get at least 4 OSDs involved?
-Greg
Post by Caspar Smit
Create a 5+3 EC profile, every hosts has a maximum of 2 shards (so 4
hosts
Post by Caspar Smit
are needed), in case of node failure -> recover data from failed node to
fifth node.
Thank you in advance,
Caspar
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Loading...