Discussion:
OSDs with primary affinity 0 still used for primary PG
Add Reply
Teun Docter
2018-02-12 15:56:48 UTC
Reply
Permalink
Raw Message
Hi,

I'm looking into storing the primary copy on SSDs, and replicas on spinners.
One way to achieve this should be the primary affinity setting, as outlined in this post:

https://www.sebastien-han.fr/blog/2015/08/06/ceph-get-the-best-of-your-ssd-with-primary-affinity

So I've deployed a small test cluster and set the affinity to 0 for half the OSDs and to 1 for the rest:

# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 0.07751 root default
-3 0.01938 host osd001
1 hdd 0.00969 osd.1 up 1.00000 1.00000
4 hdd 0.00969 osd.4 up 1.00000 0
-7 0.01938 host osd002
2 hdd 0.00969 osd.2 up 1.00000 1.00000
6 hdd 0.00969 osd.6 up 1.00000 0
-9 0.01938 host osd003
3 hdd 0.00969 osd.3 up 1.00000 1.00000
7 hdd 0.00969 osd.7 up 1.00000 0
-5 0.01938 host osd004
0 hdd 0.00969 osd.0 up 1.00000 1.00000
5 hdd 0.00969 osd.5 up 1.00000 0

Then I've created a pool. The summary at the end of "ceph pg dump" looks like this:

sum 0 0 0 0 0 0 0 0
OSD_STAT USED AVAIL TOTAL HB_PEERS PG_SUM PRIMARY_PG_SUM
7 1071M 9067M 10138M [0,1,2,3,4,5,6] 192 26
6 1072M 9066M 10138M [0,1,2,3,4,5,7] 198 18
5 1071M 9067M 10138M [0,1,2,3,4,6,7] 192 21
4 1076M 9062M 10138M [0,1,2,3,5,6,7] 202 15
3 1072M 9066M 10138M [0,1,2,4,5,6,7] 202 121
2 1072M 9066M 10138M [0,1,3,4,5,6,7] 195 114
1 1076M 9062M 10138M [0,2,3,4,5,6,7] 161 95
0 1071M 9067M 10138M [1,2,3,4,5,6,7] 194 102
sum 8587M 72524M 81111M

Now, the OSDs for which the primary affinity is set to zero are acting as primary a lot less than the others.

But what I'm wondering about is this:

For those OSDs that have primary affinity set to zero, why is the PRIMARY_PG_SUM column not zero?

# ceph -v
ceph version 12.2.2 (cf0baeeeeba3b47f9427c6c97e2144b094b7e5ba) luminous (stable)

Note that I've created the pool after setting the primary affinity, and no data is stored yet.

Thanks,
Teun
David Turner
2018-02-12 20:17:36 UTC
Reply
Permalink
Raw Message
If you look at the PGs that are primary on an OSD that has primary affinity
0, you'll find that they are only on OSDs with primary affinity of 0, so 1
of them has to take the reins or nobody would be responsible for the PG.
To prevent this from happening, you would need to configure your crush map
in a way where all PGs are guaranteed to land on at least 1 OSD that
doesn't have a primary affinity of 0.
Post by Teun Docter
Hi,
I'm looking into storing the primary copy on SSDs, and replicas on spinners.
https://www.sebastien-han.fr/blog/2015/08/06/ceph-get-the-best-of-your-ssd-with-primary-affinity
So I've deployed a small test cluster and set the affinity to 0 for half
# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 0.07751 root default
-3 0.01938 host osd001
1 hdd 0.00969 osd.1 up 1.00000 1.00000
4 hdd 0.00969 osd.4 up 1.00000 0
-7 0.01938 host osd002
2 hdd 0.00969 osd.2 up 1.00000 1.00000
6 hdd 0.00969 osd.6 up 1.00000 0
-9 0.01938 host osd003
3 hdd 0.00969 osd.3 up 1.00000 1.00000
7 hdd 0.00969 osd.7 up 1.00000 0
-5 0.01938 host osd004
0 hdd 0.00969 osd.0 up 1.00000 1.00000
5 hdd 0.00969 osd.5 up 1.00000 0
sum 0 0 0 0 0 0 0 0
OSD_STAT USED AVAIL TOTAL HB_PEERS PG_SUM PRIMARY_PG_SUM
7 1071M 9067M 10138M [0,1,2,3,4,5,6] 192 26
6 1072M 9066M 10138M [0,1,2,3,4,5,7] 198 18
5 1071M 9067M 10138M [0,1,2,3,4,6,7] 192 21
4 1076M 9062M 10138M [0,1,2,3,5,6,7] 202 15
3 1072M 9066M 10138M [0,1,2,4,5,6,7] 202 121
2 1072M 9066M 10138M [0,1,3,4,5,6,7] 195 114
1 1076M 9062M 10138M [0,2,3,4,5,6,7] 161 95
0 1071M 9067M 10138M [1,2,3,4,5,6,7] 194 102
sum 8587M 72524M 81111M
Now, the OSDs for which the primary affinity is set to zero are acting as
primary a lot less than the others.
For those OSDs that have primary affinity set to zero, why is the
PRIMARY_PG_SUM column not zero?
# ceph -v
ceph version 12.2.2 (cf0baeeeeba3b47f9427c6c97e2144b094b7e5ba) luminous (stable)
Note that I've created the pool after setting the primary affinity, and no
data is stored yet.
Thanks,
Teun
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Teun Docter
2018-02-15 10:52:51 UTC
Reply
Permalink
Raw Message
Hi David,

Thanks for explaining that, makes sense. (Though I guess the docs aren't very clear on that, but ok.) I have a follow up question on your suggestion to modify the crush map though.

I've seen a few examples on how to use crush rules to place primary copies on SSDs, and secondary copies on HDDs. In fact, one such example is in the main Ceph docs. However, they all seem to be based on the premise of having two types of OSD servers. One type would have *only* SSDs, and the other *only* HDDs.

However, that's not the scenario I'm investigating. I would like each of my OSD servers to be the same. Each would contain a number of SSDs, and a number of HDDs.

After reading up on crush rules, I think I understand how to setup a basic rule that would place the primary copy on a SSD, and the other copies on HDDs. But what I haven't figured out yet, is it possible to avoid placing one of the secondary copies on the same host that stores the primary copy?

I found an earlier thread [1] where you've hinted at using racks for this, but in that thread I think there is also some confusion about SSD/HDD only servers, versus "hybrid" servers. In addition, I found an issue in RedHats tracker [2], which also outlines this problem.

With my current understanding of crush rules, I'm not sure the setup I had in mind is feasible?

Thanks,
Teun

[1] http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-April/017589.html
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1517128
Post by David Turner
If you look at the PGs that are primary on an OSD that has primary
affinity 0, you'll find that they are only on OSDs with primary affinity
of 0, so 1 of them has to take the reins or nobody would be responsible
for the PG.  To prevent this from happening, you would need to configure
your crush map in a way where all PGs are guaranteed to land on at least
1 OSD that doesn't have a primary affinity of 0.
On Mon, Feb 12, 2018 at 2:45 PM Teun Docter
Hi,
I'm looking into storing the primary copy on SSDs, and replicas on spinners.
One way to achieve this should be the primary affinity setting, as
https://www.sebastien-han.fr/blog/2015/08/06/ceph-get-the-best-of-your-ssd-with-primary-affinity
So I've deployed a small test cluster and set the affinity to 0 for
# ceph osd tree
ID CLASS WEIGHT  TYPE NAME       STATUS REWEIGHT PRI-AFF
-1       0.07751 root default
-3       0.01938     host osd001
 1   hdd 0.00969         osd.1       up  1.00000 1.00000
 4   hdd 0.00969         osd.4       up  1.00000       0
-7       0.01938     host osd002
 2   hdd 0.00969         osd.2       up  1.00000 1.00000
 6   hdd 0.00969         osd.6       up  1.00000       0
-9       0.01938     host osd003
 3   hdd 0.00969         osd.3       up  1.00000 1.00000
 7   hdd 0.00969         osd.7       up  1.00000       0
-5       0.01938     host osd004
 0   hdd 0.00969         osd.0       up  1.00000 1.00000
 5   hdd 0.00969         osd.5       up  1.00000       0
sum 0 0 0 0 0 0 0 0
OSD_STAT USED  AVAIL  TOTAL  HB_PEERS        PG_SUM PRIMARY_PG_SUM
7        1071M  9067M 10138M [0,1,2,3,4,5,6]    192             26
6        1072M  9066M 10138M [0,1,2,3,4,5,7]    198             18
5        1071M  9067M 10138M [0,1,2,3,4,6,7]    192             21
4        1076M  9062M 10138M [0,1,2,3,5,6,7]    202             15
3        1072M  9066M 10138M [0,1,2,4,5,6,7]    202            121
2        1072M  9066M 10138M [0,1,3,4,5,6,7]    195            114
1        1076M  9062M 10138M [0,2,3,4,5,6,7]    161             95
0        1071M  9067M 10138M [1,2,3,4,5,6,7]    194            102
sum      8587M 72524M 81111M
Now, the OSDs for which the primary affinity is set to zero are
acting as primary a lot less than the others.
For those OSDs that have primary affinity set to zero, why is the
PRIMARY_PG_SUM column not zero?
# ceph -v
ceph version 12.2.2 (cf0baeeeeba3b47f9427c6c97e2144b094b7e5ba) luminous (stable)
Note that I've created the pool after setting the primary affinity,
and no data is stored yet.
Thanks,
Teun
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Eric Goirand
2018-02-15 12:04:50 UTC
Reply
Permalink
Raw Message
Hello Teun,

see below ..
Post by Teun Docter
Hi David,
Thanks for explaining that, makes sense. (Though I guess the docs aren't very clear on that, but ok.) I have a follow up question on your suggestion to modify the crush map though.
I've seen a few examples on how to use crush rules to place primary copies on SSDs, and secondary copies on HDDs. In fact, one such example is in the main Ceph docs. However, they all seem to be based on the premise of having two types of OSD servers. One type would have *only* SSDs, and the other *only* HDDs.
However, that's not the scenario I'm investigating. I would like each of my OSD servers to be the same. Each would contain a number of SSDs, and a number of HDDs.
After reading up on crush rules, I think I understand how to setup a basic rule that would place the primary copy on a SSD, and the other copies on HDDs. But what I haven't figured out yet, is it possible to avoid placing one of the secondary copies on the same host that stores the primary copy?
The only way (as of now, before including the work of the bugzilla link
[2]), to avoid having 2 copies on the same server (1 copy on a SSD drive
and 1 copy on a HDD drive) would be to separate physically the servers
that contain the SSD drives from the servers that contain the HDD drives.

You would then have to create your ruleset as you did previously but
this time the two roots you start from (step take) are separated, thus
no copy will end up on the same servers.

If you stick with the collocated drive setup, and if you still keep
min_size equals to 2 in your pools, I would suggest to use replica 4
instead of 3, to have access to all your data even if one server is down
for maintenance.
Post by Teun Docter
I found an earlier thread [1] where you've hinted at using racks for this, but in that thread I think there is also some confusion about SSD/HDD only servers, versus "hybrid" servers. In addition, I found an issue in RedHats tracker [2], which also outlines this problem.
With my current understanding of crush rules, I'm not sure the setup I had in mind is feasible?
Thanks,
Teun
Thanks,
Eric.
Post by Teun Docter
[1] http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-April/017589.html
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1517128
Post by David Turner
If you look at the PGs that are primary on an OSD that has primary
affinity 0, you'll find that they are only on OSDs with primary affinity
of 0, so 1 of them has to take the reins or nobody would be responsible
for the PG.  To prevent this from happening, you would need to configure
your crush map in a way where all PGs are guaranteed to land on at least
1 OSD that doesn't have a primary affinity of 0.
On Mon, Feb 12, 2018 at 2:45 PM Teun Docter
Hi,
I'm looking into storing the primary copy on SSDs, and replicas on spinners.
One way to achieve this should be the primary affinity setting, as
https://www.sebastien-han.fr/blog/2015/08/06/ceph-get-the-best-of-your-ssd-with-primary-affinity
So I've deployed a small test cluster and set the affinity to 0 for
# ceph osd tree
ID CLASS WEIGHT  TYPE NAME       STATUS REWEIGHT PRI-AFF
-1       0.07751 root default
-3       0.01938     host osd001
 1   hdd 0.00969         osd.1       up  1.00000 1.00000
 4   hdd 0.00969         osd.4       up  1.00000       0
-7       0.01938     host osd002
 2   hdd 0.00969         osd.2       up  1.00000 1.00000
 6   hdd 0.00969         osd.6       up  1.00000       0
-9       0.01938     host osd003
 3   hdd 0.00969         osd.3       up  1.00000 1.00000
 7   hdd 0.00969         osd.7       up  1.00000       0
-5       0.01938     host osd004
 0   hdd 0.00969         osd.0       up  1.00000 1.00000
 5   hdd 0.00969         osd.5       up  1.00000       0
Then I've created a pool. The summary at the end of "ceph pg dump"
sum 0 0 0 0 0 0 0 0
OSD_STAT USED  AVAIL  TOTAL  HB_PEERS        PG_SUM PRIMARY_PG_SUM
7        1071M  9067M 10138M [0,1,2,3,4,5,6]    192             26
6        1072M  9066M 10138M [0,1,2,3,4,5,7]    198             18
5        1071M  9067M 10138M [0,1,2,3,4,6,7]    192             21
4        1076M  9062M 10138M [0,1,2,3,5,6,7]    202             15
3        1072M  9066M 10138M [0,1,2,4,5,6,7]    202            121
2        1072M  9066M 10138M [0,1,3,4,5,6,7]    195            114
1        1076M  9062M 10138M [0,2,3,4,5,6,7]    161             95
0        1071M  9067M 10138M [1,2,3,4,5,6,7]    194            102
sum      8587M 72524M 81111M
Now, the OSDs for which the primary affinity is set to zero are
acting as primary a lot less than the others.
For those OSDs that have primary affinity set to zero, why is the
PRIMARY_PG_SUM column not zero?
# ceph -v
ceph version 12.2.2 (cf0baeeeeba3b47f9427c6c97e2144b094b7e5ba)
luminous (stable)
Note that I've created the pool after setting the primary affinity,
and no data is stored yet.
Thanks,
Teun
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Loading...