Discussion:
[ceph-users] read performance, separate client CRUSH maps or limit osd read access from each client
Vlad Kopylov
2018-11-09 03:54:18 UTC
Permalink
I am trying to test replicated ceph with servers in different buildings,
and I have a read problem.
Reads from one building go to osd in another building and vice versa,
making reads slower then writes! Making read as slow as slowest node.

Is there a way to
- disable parallel read (so it reads only from the same osd node where mon
is);
- or give each client read restriction per osd?
- or maybe strictly specify read osd on mount;
- or have node read delay cap (for example if node time out is larger then
2 ms then do not use such node for read as other replicas are available).
- or ability to place Clients on the Crush map - so it understands that osd
in - for example osd in the same data-center as client has preference, and
pull data from it/them.

Mounting with kernel client latest mimic.

Thank you!

Vlad
Martin Verges
2018-11-09 07:21:07 UTC
Permalink
Hello Vlad,

Ceph clients connect to the primary OSD of each PG. If you create a
crush rule for building1 and one for building2 that takes a OSD from
the same building as the first one, your reads to the pool will always
be on the same building (if the cluster is healthy) and only write
request get replicated to the other building.

--
Martin Verges
Managing director

Mobile: +49 174 9335695
E-Mail: ***@croit.io
Chat: https://t.me/MartinVerges

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263

Web: https://croit.io
YouTube: https://goo.gl/PGE1Bx
I am trying to test replicated ceph with servers in different buildings, and
I have a read problem.
Reads from one building go to osd in another building and vice versa, making
reads slower then writes! Making read as slow as slowest node.
Is there a way to
- disable parallel read (so it reads only from the same osd node where mon
is);
- or give each client read restriction per osd?
- or maybe strictly specify read osd on mount;
- or have node read delay cap (for example if node time out is larger then 2
ms then do not use such node for read as other replicas are available).
- or ability to place Clients on the Crush map - so it understands that osd
in - for example osd in the same data-center as client has preference, and
pull data from it/them.
Mounting with kernel client latest mimic.
Thank you!
Vlad
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Vlad Kopylov
2018-11-09 16:28:01 UTC
Permalink
Martin, thank you for the tip.
googling ceph crush rule examples doesn't give much on rules, just static
placement of buckets.
this all seems to be for placing data, not to giving client in specific
datacenter proper read osd

maybe something wrong with placement groups?

I added datacenter dc1 dc2 dc3
Current replicated_rule is

rule replicated_rule {
id 0
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}

# buckets
host ceph1 {
id -3 # do not change unnecessarily
id -2 class ssd # do not change unnecessarily
# weight 1.000
alg straw2
hash 0 # rjenkins1
item osd.0 weight 1.000
}
datacenter dc1 {
id -9 # do not change unnecessarily
id -4 class ssd # do not change unnecessarily
# weight 1.000
alg straw2
hash 0 # rjenkins1
item ceph1 weight 1.000
}
host ceph2 {
id -5 # do not change unnecessarily
id -6 class ssd # do not change unnecessarily
# weight 1.000
alg straw2
hash 0 # rjenkins1
item osd.1 weight 1.000
}
datacenter dc2 {
id -10 # do not change unnecessarily
id -8 class ssd # do not change unnecessarily
# weight 1.000
alg straw2
hash 0 # rjenkins1
item ceph2 weight 1.000
}
host ceph3 {
id -7 # do not change unnecessarily
id -12 class ssd # do not change unnecessarily
# weight 1.000
alg straw2
hash 0 # rjenkins1
item osd.2 weight 1.000
}
datacenter dc3 {
id -11 # do not change unnecessarily
id -13 class ssd # do not change unnecessarily
# weight 1.000
alg straw2
hash 0 # rjenkins1
item ceph3 weight 1.000
}
root default {
id -1 # do not change unnecessarily
id -14 class ssd # do not change unnecessarily
# weight 3.000
alg straw2
hash 0 # rjenkins1
item dc1 weight 1.000
item dc2 weight 1.000
item dc3 weight 1.000
}


#ceph pg dump
dumped all
version 29433
stamp 2018-11-09 11:23:44.510872
last_osdmap_epoch 0
last_pg_scan 0
PG_STAT OBJECTS MISSING_ON_PRIMARY DEGRADED MISPLACED UNFOUND BYTES
LOG DISK_LOG STATE STATE_STAMP
VERSION REPORTED UP UP_PRIMARY ACTING ACTING_PRIMARY LAST_SCRUB
SCRUB_STAMP LAST_DEEP_SCRUB DEEP_SCRUB_STAMP
SNAPTRIMQ_LEN
1.5f 0 0 0 0 0 0
0 0 active+clean 2018-11-09 04:35:32.320607
0'0 544:1317 [0,2,1] 0 [0,2,1] 0 0'0
2018-11-09 04:35:32.320561 0'0 2018-11-04 11:55:54.756115
0
2.5c 143 0 143 0 0 19490267
461 461 active+undersized+degraded 2018-11-08 19:02:03.873218
508'461 544:2100 [2,1] 2 [2,1] 2 290'380
2018-11-07 18:58:43.043719 64'120 2018-11-05 14:21:49.256324
0
.....
sum 15239 0 2053 2659 0 2157615019 58286 58286
OSD_STAT USED AVAIL TOTAL HB_PEERS PG_SUM PRIMARY_PG_SUM
2 3.7 GiB 28 GiB 32 GiB [0,1] 200 73
1 3.7 GiB 28 GiB 32 GiB [0,2] 200 58
0 3.7 GiB 28 GiB 32 GiB [1,2] 173 69
sum 11 GiB 85 GiB 96 GiB

#ceph pg map 2.5c
osdmap e545 pg 2.5c (2.5c) -> up [2,1] acting [2,1]

#pg map 1.5f
osdmap e547 pg 1.5f (1.5f) -> up [0,2,1] acting [0,2,1]
Post by Martin Verges
Hello Vlad,
Ceph clients connect to the primary OSD of each PG. If you create a
crush rule for building1 and one for building2 that takes a OSD from
the same building as the first one, your reads to the pool will always
be on the same building (if the cluster is healthy) and only write
request get replicated to the other building.
--
Martin Verges
Managing director
Mobile: +49 174 9335695
Chat: https://t.me/MartinVerges
croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web: https://croit.io
YouTube: https://goo.gl/PGE1Bx
Post by Vlad Kopylov
I am trying to test replicated ceph with servers in different buildings,
and
Post by Vlad Kopylov
I have a read problem.
Reads from one building go to osd in another building and vice versa,
making
Post by Vlad Kopylov
reads slower then writes! Making read as slow as slowest node.
Is there a way to
- disable parallel read (so it reads only from the same osd node where
mon
Post by Vlad Kopylov
is);
- or give each client read restriction per osd?
- or maybe strictly specify read osd on mount;
- or have node read delay cap (for example if node time out is larger
then 2
Post by Vlad Kopylov
ms then do not use such node for read as other replicas are available).
- or ability to place Clients on the Crush map - so it understands that
osd
Post by Vlad Kopylov
in - for example osd in the same data-center as client has preference,
and
Post by Vlad Kopylov
pull data from it/them.
Mounting with kernel client latest mimic.
Thank you!
Vlad
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Vlad Kopylov
2018-11-09 16:35:52 UTC
Permalink
Please disregard pg status, one of test vms was down for some time it is
healing.
Question only how to make it read from proper datacenter

If you have an example.

Thanks
Post by Vlad Kopylov
Martin, thank you for the tip.
googling ceph crush rule examples doesn't give much on rules, just static
placement of buckets.
this all seems to be for placing data, not to giving client in specific
datacenter proper read osd
maybe something wrong with placement groups?
I added datacenter dc1 dc2 dc3
Current replicated_rule is
rule replicated_rule {
id 0
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}
# buckets
host ceph1 {
id -3 # do not change unnecessarily
id -2 class ssd # do not change unnecessarily
# weight 1.000
alg straw2
hash 0 # rjenkins1
item osd.0 weight 1.000
}
datacenter dc1 {
id -9 # do not change unnecessarily
id -4 class ssd # do not change unnecessarily
# weight 1.000
alg straw2
hash 0 # rjenkins1
item ceph1 weight 1.000
}
host ceph2 {
id -5 # do not change unnecessarily
id -6 class ssd # do not change unnecessarily
# weight 1.000
alg straw2
hash 0 # rjenkins1
item osd.1 weight 1.000
}
datacenter dc2 {
id -10 # do not change unnecessarily
id -8 class ssd # do not change unnecessarily
# weight 1.000
alg straw2
hash 0 # rjenkins1
item ceph2 weight 1.000
}
host ceph3 {
id -7 # do not change unnecessarily
id -12 class ssd # do not change unnecessarily
# weight 1.000
alg straw2
hash 0 # rjenkins1
item osd.2 weight 1.000
}
datacenter dc3 {
id -11 # do not change unnecessarily
id -13 class ssd # do not change unnecessarily
# weight 1.000
alg straw2
hash 0 # rjenkins1
item ceph3 weight 1.000
}
root default {
id -1 # do not change unnecessarily
id -14 class ssd # do not change unnecessarily
# weight 3.000
alg straw2
hash 0 # rjenkins1
item dc1 weight 1.000
item dc2 weight 1.000
item dc3 weight 1.000
}
#ceph pg dump
dumped all
version 29433
stamp 2018-11-09 11:23:44.510872
last_osdmap_epoch 0
last_pg_scan 0
PG_STAT OBJECTS MISSING_ON_PRIMARY DEGRADED MISPLACED UNFOUND BYTES LOG DISK_LOG STATE STATE_STAMP VERSION REPORTED UP UP_PRIMARY ACTING ACTING_PRIMARY LAST_SCRUB SCRUB_STAMP LAST_DEEP_SCRUB DEEP_SCRUB_STAMP SNAPTRIMQ_LEN
1.5f 0 0 0 0 0 0 0 0 active+clean 2018-11-09 04:35:32.320607 0'0 544:1317 [0,2,1] 0 [0,2,1] 0 0'0 2018-11-09 04:35:32.320561 0'0 2018-11-04 11:55:54.756115 0
2.5c 143 0 143 0 0 19490267 461 461 active+undersized+degraded 2018-11-08 19:02:03.873218 508'461 544:2100 [2,1] 2 [2,1] 2 290'380 2018-11-07 18:58:43.043719 64'120 2018-11-05 14:21:49.256324 0
.....
sum 15239 0 2053 2659 0 2157615019 58286 58286
OSD_STAT USED AVAIL TOTAL HB_PEERS PG_SUM PRIMARY_PG_SUM
2 3.7 GiB 28 GiB 32 GiB [0,1] 200 73
1 3.7 GiB 28 GiB 32 GiB [0,2] 200 58
0 3.7 GiB 28 GiB 32 GiB [1,2] 173 69
sum 11 GiB 85 GiB 96 GiB
#ceph pg map 2.5c
osdmap e545 pg 2.5c (2.5c) -> up [2,1] acting [2,1]
#pg map 1.5f
osdmap e547 pg 1.5f (1.5f) -> up [0,2,1] acting [0,2,1]
Post by Martin Verges
Hello Vlad,
Ceph clients connect to the primary OSD of each PG. If you create a
crush rule for building1 and one for building2 that takes a OSD from
the same building as the first one, your reads to the pool will always
be on the same building (if the cluster is healthy) and only write
request get replicated to the other building.
--
Martin Verges
Managing director
Mobile: +49 174 9335695
Chat: https://t.me/MartinVerges
croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web: https://croit.io
YouTube: https://goo.gl/PGE1Bx
Post by Vlad Kopylov
I am trying to test replicated ceph with servers in different
buildings, and
Post by Vlad Kopylov
I have a read problem.
Reads from one building go to osd in another building and vice versa,
making
Post by Vlad Kopylov
reads slower then writes! Making read as slow as slowest node.
Is there a way to
- disable parallel read (so it reads only from the same osd node where
mon
Post by Vlad Kopylov
is);
- or give each client read restriction per osd?
- or maybe strictly specify read osd on mount;
- or have node read delay cap (for example if node time out is larger
then 2
Post by Vlad Kopylov
ms then do not use such node for read as other replicas are available).
- or ability to place Clients on the Crush map - so it understands that
osd
Post by Vlad Kopylov
in - for example osd in the same data-center as client has preference,
and
Post by Vlad Kopylov
pull data from it/them.
Mounting with kernel client latest mimic.
Thank you!
Vlad
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Martin Verges
2018-11-09 19:09:56 UTC
Permalink
Hello Vlad,

you can generate something like this:

rule dc1_primary_dc2_secondary {
id 1
type replicated
min_size 1
max_size 10
step take dc1
step chooseleaf firstn 1 type host
step emit
step take dc2
step chooseleaf firstn 1 type host
step emit
step take dc3
step chooseleaf firstn -2 type host
step emit
}

rule dc2_primary_dc1_secondary {
id 2
type replicated
min_size 1
max_size 10
step take dc1
step chooseleaf firstn 1 type host
step emit
step take dc2
step chooseleaf firstn 1 type host
step emit
step take dc3
step chooseleaf firstn -2 type host
step emit
}

After you added such crush rules, you can configure the pools:

~ $ ceph osd pool set <pool_for_dc1> crush_ruleset 1
~ $ ceph osd pool set <pool_for_dc2> crush_ruleset 2

Now you place your workload from dc1 to the dc1 pool, and workload
from dc2 to the dc2 pool. You could also use HDD with SSD journal (if
your workload issn't that write intensive) and save some money in dc3
as your client would always read from a SSD and write to Hybrid.

Btw. all this could be done with a few simple clicks through our web
frontend. Even if you want to export it via CephFS / NFS / .. it is
possible to set it on a per folder level. Feel free to take a look at
to see how easy it could
be.

--
Martin Verges
Managing director

Mobile: +49 174 9335695
E-Mail: ***@croit.io
Chat: https://t.me/MartinVerges

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263

Web: https://croit.io
YouTube: https://goo.gl/PGE1Bx
Post by Vlad Kopylov
Please disregard pg status, one of test vms was down for some time it is
healing.
Question only how to make it read from proper datacenter
If you have an example.
Thanks
Post by Vlad Kopylov
Martin, thank you for the tip.
googling ceph crush rule examples doesn't give much on rules, just static
placement of buckets.
this all seems to be for placing data, not to giving client in specific
datacenter proper read osd
maybe something wrong with placement groups?
I added datacenter dc1 dc2 dc3
Current replicated_rule is
rule replicated_rule {
id 0
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}
# buckets
host ceph1 {
id -3 # do not change unnecessarily
id -2 class ssd # do not change unnecessarily
# weight 1.000
alg straw2
hash 0 # rjenkins1
item osd.0 weight 1.000
}
datacenter dc1 {
id -9 # do not change unnecessarily
id -4 class ssd # do not change unnecessarily
# weight 1.000
alg straw2
hash 0 # rjenkins1
item ceph1 weight 1.000
}
host ceph2 {
id -5 # do not change unnecessarily
id -6 class ssd # do not change unnecessarily
# weight 1.000
alg straw2
hash 0 # rjenkins1
item osd.1 weight 1.000
}
datacenter dc2 {
id -10 # do not change unnecessarily
id -8 class ssd # do not change unnecessarily
# weight 1.000
alg straw2
hash 0 # rjenkins1
item ceph2 weight 1.000
}
host ceph3 {
id -7 # do not change unnecessarily
id -12 class ssd # do not change unnecessarily
# weight 1.000
alg straw2
hash 0 # rjenkins1
item osd.2 weight 1.000
}
datacenter dc3 {
id -11 # do not change unnecessarily
id -13 class ssd # do not change unnecessarily
# weight 1.000
alg straw2
hash 0 # rjenkins1
item ceph3 weight 1.000
}
root default {
id -1 # do not change unnecessarily
id -14 class ssd # do not change unnecessarily
# weight 3.000
alg straw2
hash 0 # rjenkins1
item dc1 weight 1.000
item dc2 weight 1.000
item dc3 weight 1.000
}
#ceph pg dump
dumped all
version 29433
stamp 2018-11-09 11:23:44.510872
last_osdmap_epoch 0
last_pg_scan 0
PG_STAT OBJECTS MISSING_ON_PRIMARY DEGRADED MISPLACED UNFOUND BYTES LOG
DISK_LOG STATE STATE_STAMP VERSION
REPORTED UP UP_PRIMARY ACTING ACTING_PRIMARY LAST_SCRUB SCRUB_STAMP
LAST_DEEP_SCRUB DEEP_SCRUB_STAMP SNAPTRIMQ_LEN
1.5f 0 0 0 0 0 0
0 0 active+clean 2018-11-09 04:35:32.320607 0'0
544:1317 [0,2,1] 0 [0,2,1] 0 0'0 2018-11-09
04:35:32.320561 0'0 2018-11-04 11:55:54.756115 0
2.5c 143 0 143 0 0 19490267
461 461 active+undersized+degraded 2018-11-08 19:02:03.873218 508'461
544:2100 [2,1] 2 [2,1] 2 290'380 2018-11-07
18:58:43.043719 64'120 2018-11-05 14:21:49.256324 0
.....
sum 15239 0 2053 2659 0 2157615019 58286 58286
OSD_STAT USED AVAIL TOTAL HB_PEERS PG_SUM PRIMARY_PG_SUM
2 3.7 GiB 28 GiB 32 GiB [0,1] 200 73
1 3.7 GiB 28 GiB 32 GiB [0,2] 200 58
0 3.7 GiB 28 GiB 32 GiB [1,2] 173 69
sum 11 GiB 85 GiB 96 GiB
#ceph pg map 2.5c
osdmap e545 pg 2.5c (2.5c) -> up [2,1] acting [2,1]
#pg map 1.5f
osdmap e547 pg 1.5f (1.5f) -> up [0,2,1] acting [0,2,1]
Post by Martin Verges
Hello Vlad,
Ceph clients connect to the primary OSD of each PG. If you create a
crush rule for building1 and one for building2 that takes a OSD from
the same building as the first one, your reads to the pool will always
be on the same building (if the cluster is healthy) and only write
request get replicated to the other building.
--
Martin Verges
Managing director
Mobile: +49 174 9335695
Chat: https://t.me/MartinVerges
croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web: https://croit.io
YouTube: https://goo.gl/PGE1Bx
I am trying to test replicated ceph with servers in different buildings, and
I have a read problem.
Reads from one building go to osd in another building and vice versa, making
reads slower then writes! Making read as slow as slowest node.
Is there a way to
- disable parallel read (so it reads only from the same osd node where mon
is);
- or give each client read restriction per osd?
- or maybe strictly specify read osd on mount;
- or have node read delay cap (for example if node time out is larger then 2
ms then do not use such node for read as other replicas are available).
- or ability to place Clients on the Crush map - so it understands that osd
in - for example osd in the same data-center as client has preference, and
pull data from it/them.
Mounting with kernel client latest mimic.
Thank you!
Vlad
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Vlad Kopylov
2018-11-11 18:47:11 UTC
Permalink
Maybe it is possible if done via gateway-nfs export?
Settings for gateway allow read osd selection?

v
Post by Martin Verges
Hello Vlad,
If you want to read from the same data, then it ist not possible (as far I
know).
--
Martin Verges
Managing director
Mobile: +49 174 9335695
Chat: https://t.me/MartinVerges
croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web: https://croit.io
YouTube: https://goo.gl/PGE1Bx
Maybe i missed something but FS is explicitly selecting pools to put
files and metadata, like I did below.
So if I create new pools - data in them will be different. If I apply the
rule dc1_primary to cfs_data pool, and client from dc3 connects to fs t01 -
it will start using dc1 hosts
ceph osd pool create cfs_data 100
ceph osd pool create cfs_meta 100
ceph fs new t01 cfs_data cfs_meta
sudo mount -t ceph ceph1:6789:/ /mnt/t01 -o
name=admin,secretfile=/home/mciadmin/admin.secret
rule dc1_primary {
id 1
type replicated
min_size 1
max_size 10
step take dc1
step chooseleaf firstn 1 type host
step emit
step take dc2
step chooseleaf firstn -2 type host
step emit
step take dc3
step chooseleaf firstn -2 type host
step emit
}
Just to confirm - it will still populate 3 copies in each datacenter?
Thought this map was to select where to write to, guess it does write
replication on the back end.
I thought pools are completely separate and clients would not see each
others data?
Thank you Martin!
Post by Martin Verges
Hello Vlad,
rule dc1_primary_dc2_secondary {
id 1
type replicated
min_size 1
max_size 10
step take dc1
step chooseleaf firstn 1 type host
step emit
step take dc2
step chooseleaf firstn 1 type host
step emit
step take dc3
step chooseleaf firstn -2 type host
step emit
}
rule dc2_primary_dc1_secondary {
id 2
type replicated
min_size 1
max_size 10
step take dc1
step chooseleaf firstn 1 type host
step emit
step take dc2
step chooseleaf firstn 1 type host
step emit
step take dc3
step chooseleaf firstn -2 type host
step emit
}
~ $ ceph osd pool set <pool_for_dc1> crush_ruleset 1
~ $ ceph osd pool set <pool_for_dc2> crush_ruleset 2
Now you place your workload from dc1 to the dc1 pool, and workload
from dc2 to the dc2 pool. You could also use HDD with SSD journal (if
your workload issn't that write intensive) and save some money in dc3
as your client would always read from a SSD and write to Hybrid.
Btw. all this could be done with a few simple clicks through our web
frontend. Even if you want to export it via CephFS / NFS / .. it is
possible to set it on a per folder level. Feel free to take a look at
http://youtu.be/V33f7ipw9d4 to see how easy it could
be.
--
Martin Verges
Managing director
Mobile: +49 174 9335695
Chat: https://t.me/MartinVerges
croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web: https://croit.io
YouTube: https://goo.gl/PGE1Bx
Post by Vlad Kopylov
Please disregard pg status, one of test vms was down for some time it
is
Post by Vlad Kopylov
healing.
Question only how to make it read from proper datacenter
If you have an example.
Thanks
Post by Vlad Kopylov
Martin, thank you for the tip.
googling ceph crush rule examples doesn't give much on rules, just
static
Post by Vlad Kopylov
Post by Vlad Kopylov
placement of buckets.
this all seems to be for placing data, not to giving client in
specific
Post by Vlad Kopylov
Post by Vlad Kopylov
datacenter proper read osd
maybe something wrong with placement groups?
I added datacenter dc1 dc2 dc3
Current replicated_rule is
rule replicated_rule {
id 0
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}
# buckets
host ceph1 {
id -3 # do not change unnecessarily
id -2 class ssd # do not change unnecessarily
# weight 1.000
alg straw2
hash 0 # rjenkins1
item osd.0 weight 1.000
}
datacenter dc1 {
id -9 # do not change unnecessarily
id -4 class ssd # do not change unnecessarily
# weight 1.000
alg straw2
hash 0 # rjenkins1
item ceph1 weight 1.000
}
host ceph2 {
id -5 # do not change unnecessarily
id -6 class ssd # do not change unnecessarily
# weight 1.000
alg straw2
hash 0 # rjenkins1
item osd.1 weight 1.000
}
datacenter dc2 {
id -10 # do not change unnecessarily
id -8 class ssd # do not change unnecessarily
# weight 1.000
alg straw2
hash 0 # rjenkins1
item ceph2 weight 1.000
}
host ceph3 {
id -7 # do not change unnecessarily
id -12 class ssd # do not change unnecessarily
# weight 1.000
alg straw2
hash 0 # rjenkins1
item osd.2 weight 1.000
}
datacenter dc3 {
id -11 # do not change unnecessarily
id -13 class ssd # do not change unnecessarily
# weight 1.000
alg straw2
hash 0 # rjenkins1
item ceph3 weight 1.000
}
root default {
id -1 # do not change unnecessarily
id -14 class ssd # do not change unnecessarily
# weight 3.000
alg straw2
hash 0 # rjenkins1
item dc1 weight 1.000
item dc2 weight 1.000
item dc3 weight 1.000
}
#ceph pg dump
dumped all
version 29433
stamp 2018-11-09 11:23:44.510872
last_osdmap_epoch 0
last_pg_scan 0
PG_STAT OBJECTS MISSING_ON_PRIMARY DEGRADED MISPLACED UNFOUND BYTES
LOG
Post by Vlad Kopylov
Post by Vlad Kopylov
DISK_LOG STATE STATE_STAMP
VERSION
Post by Vlad Kopylov
Post by Vlad Kopylov
REPORTED UP UP_PRIMARY ACTING ACTING_PRIMARY LAST_SCRUB
SCRUB_STAMP
Post by Vlad Kopylov
Post by Vlad Kopylov
LAST_DEEP_SCRUB DEEP_SCRUB_STAMP SNAPTRIMQ_LEN
1.5f 0 0 0 0 0
0
Post by Vlad Kopylov
Post by Vlad Kopylov
0 0 active+clean 2018-11-09 04:35:32.320607
0'0
Post by Vlad Kopylov
Post by Vlad Kopylov
544:1317 [0,2,1] 0 [0,2,1] 0 0'0
2018-11-09
Post by Vlad Kopylov
Post by Vlad Kopylov
04:35:32.320561 0'0 2018-11-04 11:55:54.756115
0
Post by Vlad Kopylov
Post by Vlad Kopylov
2.5c 143 0 143 0 0
19490267
Post by Vlad Kopylov
Post by Vlad Kopylov
461 461 active+undersized+degraded 2018-11-08 19:02:03.873218
508'461
Post by Vlad Kopylov
Post by Vlad Kopylov
544:2100 [2,1] 2 [2,1] 2 290'380
2018-11-07
Post by Vlad Kopylov
Post by Vlad Kopylov
18:58:43.043719 64'120 2018-11-05 14:21:49.256324
0
Post by Vlad Kopylov
Post by Vlad Kopylov
.....
sum 15239 0 2053 2659 0 2157615019 58286 58286
OSD_STAT USED AVAIL TOTAL HB_PEERS PG_SUM PRIMARY_PG_SUM
2 3.7 GiB 28 GiB 32 GiB [0,1] 200 73
1 3.7 GiB 28 GiB 32 GiB [0,2] 200 58
0 3.7 GiB 28 GiB 32 GiB [1,2] 173 69
sum 11 GiB 85 GiB 96 GiB
#ceph pg map 2.5c
osdmap e545 pg 2.5c (2.5c) -> up [2,1] acting [2,1]
#pg map 1.5f
osdmap e547 pg 1.5f (1.5f) -> up [0,2,1] acting [0,2,1]
Post by Martin Verges
Hello Vlad,
Ceph clients connect to the primary OSD of each PG. If you create a
crush rule for building1 and one for building2 that takes a OSD from
the same building as the first one, your reads to the pool will
always
Post by Vlad Kopylov
Post by Vlad Kopylov
Post by Martin Verges
be on the same building (if the cluster is healthy) and only write
request get replicated to the other building.
--
Martin Verges
Managing director
Mobile: +49 174 9335695
Chat: https://t.me/MartinVerges
croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web: https://croit.io
YouTube: https://goo.gl/PGE1Bx
Post by Vlad Kopylov
I am trying to test replicated ceph with servers in different
buildings, and
I have a read problem.
Reads from one building go to osd in another building and vice
versa,
Post by Vlad Kopylov
Post by Vlad Kopylov
Post by Martin Verges
Post by Vlad Kopylov
making
reads slower then writes! Making read as slow as slowest node.
Is there a way to
- disable parallel read (so it reads only from the same osd node
where
Post by Vlad Kopylov
Post by Vlad Kopylov
Post by Martin Verges
Post by Vlad Kopylov
mon
is);
- or give each client read restriction per osd?
- or maybe strictly specify read osd on mount;
- or have node read delay cap (for example if node time out is
larger
Post by Vlad Kopylov
Post by Vlad Kopylov
Post by Martin Verges
Post by Vlad Kopylov
then 2
ms then do not use such node for read as other replicas are
available).
Post by Vlad Kopylov
Post by Vlad Kopylov
Post by Martin Verges
Post by Vlad Kopylov
- or ability to place Clients on the Crush map - so it
understands that
Post by Vlad Kopylov
Post by Vlad Kopylov
Post by Martin Verges
Post by Vlad Kopylov
osd
in - for example osd in the same data-center as client has
preference,
Post by Vlad Kopylov
Post by Vlad Kopylov
Post by Martin Verges
Post by Vlad Kopylov
and
pull data from it/them.
Mounting with kernel client latest mimic.
Thank you!
Vlad
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Vlad Kopylov
2018-11-13 20:19:53 UTC
Permalink
Or is it possible to mount one OSD directly for read file access?

v
Post by Vlad Kopylov
Maybe it is possible if done via gateway-nfs export?
Settings for gateway allow read osd selection?
v
Post by Martin Verges
Hello Vlad,
If you want to read from the same data, then it ist not possible (as far
I know).
--
Martin Verges
Managing director
Mobile: +49 174 9335695
Chat: https://t.me/MartinVerges
croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web: https://croit.io
YouTube: https://goo.gl/PGE1Bx
Maybe i missed something but FS is explicitly selecting pools to put
files and metadata, like I did below.
So if I create new pools - data in them will be different. If I apply
the rule dc1_primary to cfs_data pool, and client from dc3 connects to fs
t01 - it will start using dc1 hosts
ceph osd pool create cfs_data 100
ceph osd pool create cfs_meta 100
ceph fs new t01 cfs_data cfs_meta
sudo mount -t ceph ceph1:6789:/ /mnt/t01 -o
name=admin,secretfile=/home/mciadmin/admin.secret
rule dc1_primary {
id 1
type replicated
min_size 1
max_size 10
step take dc1
step chooseleaf firstn 1 type host
step emit
step take dc2
step chooseleaf firstn -2 type host
step emit
step take dc3
step chooseleaf firstn -2 type host
step emit
}
Just to confirm - it will still populate 3 copies in each datacenter?
Thought this map was to select where to write to, guess it does write
replication on the back end.
I thought pools are completely separate and clients would not see each
others data?
Thank you Martin!
Post by Martin Verges
Hello Vlad,
rule dc1_primary_dc2_secondary {
id 1
type replicated
min_size 1
max_size 10
step take dc1
step chooseleaf firstn 1 type host
step emit
step take dc2
step chooseleaf firstn 1 type host
step emit
step take dc3
step chooseleaf firstn -2 type host
step emit
}
rule dc2_primary_dc1_secondary {
id 2
type replicated
min_size 1
max_size 10
step take dc1
step chooseleaf firstn 1 type host
step emit
step take dc2
step chooseleaf firstn 1 type host
step emit
step take dc3
step chooseleaf firstn -2 type host
step emit
}
~ $ ceph osd pool set <pool_for_dc1> crush_ruleset 1
~ $ ceph osd pool set <pool_for_dc2> crush_ruleset 2
Now you place your workload from dc1 to the dc1 pool, and workload
from dc2 to the dc2 pool. You could also use HDD with SSD journal (if
your workload issn't that write intensive) and save some money in dc3
as your client would always read from a SSD and write to Hybrid.
Btw. all this could be done with a few simple clicks through our web
frontend. Even if you want to export it via CephFS / NFS / .. it is
possible to set it on a per folder level. Feel free to take a look at
http://youtu.be/V33f7ipw9d4 to see how easy it could
be.
--
Martin Verges
Managing director
Mobile: +49 174 9335695
Chat: https://t.me/MartinVerges
croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web: https://croit.io
YouTube: https://goo.gl/PGE1Bx
Post by Vlad Kopylov
Please disregard pg status, one of test vms was down for some time
it is
Post by Vlad Kopylov
healing.
Question only how to make it read from proper datacenter
If you have an example.
Thanks
Post by Vlad Kopylov
Martin, thank you for the tip.
googling ceph crush rule examples doesn't give much on rules, just
static
Post by Vlad Kopylov
Post by Vlad Kopylov
placement of buckets.
this all seems to be for placing data, not to giving client in
specific
Post by Vlad Kopylov
Post by Vlad Kopylov
datacenter proper read osd
maybe something wrong with placement groups?
I added datacenter dc1 dc2 dc3
Current replicated_rule is
rule replicated_rule {
id 0
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}
# buckets
host ceph1 {
id -3 # do not change unnecessarily
id -2 class ssd # do not change unnecessarily
# weight 1.000
alg straw2
hash 0 # rjenkins1
item osd.0 weight 1.000
}
datacenter dc1 {
id -9 # do not change unnecessarily
id -4 class ssd # do not change unnecessarily
# weight 1.000
alg straw2
hash 0 # rjenkins1
item ceph1 weight 1.000
}
host ceph2 {
id -5 # do not change unnecessarily
id -6 class ssd # do not change unnecessarily
# weight 1.000
alg straw2
hash 0 # rjenkins1
item osd.1 weight 1.000
}
datacenter dc2 {
id -10 # do not change unnecessarily
id -8 class ssd # do not change unnecessarily
# weight 1.000
alg straw2
hash 0 # rjenkins1
item ceph2 weight 1.000
}
host ceph3 {
id -7 # do not change unnecessarily
id -12 class ssd # do not change unnecessarily
# weight 1.000
alg straw2
hash 0 # rjenkins1
item osd.2 weight 1.000
}
datacenter dc3 {
id -11 # do not change unnecessarily
id -13 class ssd # do not change unnecessarily
# weight 1.000
alg straw2
hash 0 # rjenkins1
item ceph3 weight 1.000
}
root default {
id -1 # do not change unnecessarily
id -14 class ssd # do not change unnecessarily
# weight 3.000
alg straw2
hash 0 # rjenkins1
item dc1 weight 1.000
item dc2 weight 1.000
item dc3 weight 1.000
}
#ceph pg dump
dumped all
version 29433
stamp 2018-11-09 11:23:44.510872
last_osdmap_epoch 0
last_pg_scan 0
PG_STAT OBJECTS MISSING_ON_PRIMARY DEGRADED MISPLACED UNFOUND
BYTES LOG
Post by Vlad Kopylov
Post by Vlad Kopylov
DISK_LOG STATE STATE_STAMP
VERSION
Post by Vlad Kopylov
Post by Vlad Kopylov
REPORTED UP UP_PRIMARY ACTING ACTING_PRIMARY LAST_SCRUB
SCRUB_STAMP
Post by Vlad Kopylov
Post by Vlad Kopylov
LAST_DEEP_SCRUB DEEP_SCRUB_STAMP SNAPTRIMQ_LEN
1.5f 0 0 0 0 0
0
Post by Vlad Kopylov
Post by Vlad Kopylov
0 0 active+clean 2018-11-09 04:35:32.320607
0'0
Post by Vlad Kopylov
Post by Vlad Kopylov
544:1317 [0,2,1] 0 [0,2,1] 0 0'0
2018-11-09
Post by Vlad Kopylov
Post by Vlad Kopylov
04:35:32.320561 0'0 2018-11-04 11:55:54.756115
0
Post by Vlad Kopylov
Post by Vlad Kopylov
2.5c 143 0 143 0 0
19490267
Post by Vlad Kopylov
Post by Vlad Kopylov
461 461 active+undersized+degraded 2018-11-08 19:02:03.873218
508'461
Post by Vlad Kopylov
Post by Vlad Kopylov
544:2100 [2,1] 2 [2,1] 2 290'380
2018-11-07
Post by Vlad Kopylov
Post by Vlad Kopylov
18:58:43.043719 64'120 2018-11-05 14:21:49.256324
0
Post by Vlad Kopylov
Post by Vlad Kopylov
.....
sum 15239 0 2053 2659 0 2157615019 58286 58286
OSD_STAT USED AVAIL TOTAL HB_PEERS PG_SUM PRIMARY_PG_SUM
2 3.7 GiB 28 GiB 32 GiB [0,1] 200 73
1 3.7 GiB 28 GiB 32 GiB [0,2] 200 58
0 3.7 GiB 28 GiB 32 GiB [1,2] 173 69
sum 11 GiB 85 GiB 96 GiB
#ceph pg map 2.5c
osdmap e545 pg 2.5c (2.5c) -> up [2,1] acting [2,1]
#pg map 1.5f
osdmap e547 pg 1.5f (1.5f) -> up [0,2,1] acting [0,2,1]
On Fri, Nov 9, 2018 at 2:21 AM Martin Verges <
Post by Martin Verges
Hello Vlad,
Ceph clients connect to the primary OSD of each PG. If you create a
crush rule for building1 and one for building2 that takes a OSD
from
Post by Vlad Kopylov
Post by Vlad Kopylov
Post by Martin Verges
the same building as the first one, your reads to the pool will
always
Post by Vlad Kopylov
Post by Vlad Kopylov
Post by Martin Verges
be on the same building (if the cluster is healthy) and only write
request get replicated to the other building.
--
Martin Verges
Managing director
Mobile: +49 174 9335695
Chat: https://t.me/MartinVerges
croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web: https://croit.io
YouTube: https://goo.gl/PGE1Bx
Post by Vlad Kopylov
I am trying to test replicated ceph with servers in different
buildings, and
I have a read problem.
Reads from one building go to osd in another building and vice
versa,
Post by Vlad Kopylov
Post by Vlad Kopylov
Post by Martin Verges
Post by Vlad Kopylov
making
reads slower then writes! Making read as slow as slowest node.
Is there a way to
- disable parallel read (so it reads only from the same osd node
where
Post by Vlad Kopylov
Post by Vlad Kopylov
Post by Martin Verges
Post by Vlad Kopylov
mon
is);
- or give each client read restriction per osd?
- or maybe strictly specify read osd on mount;
- or have node read delay cap (for example if node time out is
larger
Post by Vlad Kopylov
Post by Vlad Kopylov
Post by Martin Verges
Post by Vlad Kopylov
then 2
ms then do not use such node for read as other replicas are
available).
Post by Vlad Kopylov
Post by Vlad Kopylov
Post by Martin Verges
Post by Vlad Kopylov
- or ability to place Clients on the Crush map - so it
understands that
Post by Vlad Kopylov
Post by Vlad Kopylov
Post by Martin Verges
Post by Vlad Kopylov
osd
in - for example osd in the same data-center as client has
preference,
Post by Vlad Kopylov
Post by Vlad Kopylov
Post by Martin Verges
Post by Vlad Kopylov
and
pull data from it/them.
Mounting with kernel client latest mimic.
Thank you!
Vlad
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Jean-Charles Lopez
2018-11-13 21:25:53 UTC
Permalink
Hi Vlad,

No need for a specific CRUSH map configuration. I’d suggest you use the primary-affinity setting on the OSD so that only the OSDs that are close to your read point are are selected as primary.

See https://ceph.com/geen-categorie/ceph-primary-affinity/ for information

Just set the primary affinity of all the OSDs in building 2 to 0.

Only the OSDs in building 1 should then be used as primary OSDs.

BR
JC
Post by Vlad Kopylov
Or is it possible to mount one OSD directly for read file access?
v
Maybe it is possible if done via gateway-nfs export?
Settings for gateway allow read osd selection?
v
Hello Vlad,
If you want to read from the same data, then it ist not possible (as far I know).
--
Martin Verges
Managing director
Mobile: +49 174 9335695
Chat: https://t.me/MartinVerges <https://t.me/MartinVerges>
croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web: https://croit.io <https://croit.io/>
YouTube: https://goo.gl/PGE1Bx <https://goo.gl/PGE1Bx>
Maybe i missed something but FS is explicitly selecting pools to put files and metadata, like I did below.
So if I create new pools - data in them will be different. If I apply the rule dc1_primary to cfs_data pool, and client from dc3 connects to fs t01 - it will start using dc1 hosts
ceph osd pool create cfs_data 100
ceph osd pool create cfs_meta 100
ceph fs new t01 cfs_data cfs_meta
sudo mount -t ceph ceph1:6789:/ /mnt/t01 -o name=admin,secretfile=/home/mciadmin/admin.secret
rule dc1_primary {
id 1
type replicated
min_size 1
max_size 10
step take dc1
step chooseleaf firstn 1 type host
step emit
step take dc2
step chooseleaf firstn -2 type host
step emit
step take dc3
step chooseleaf firstn -2 type host
step emit
}
Just to confirm - it will still populate 3 copies in each datacenter?
Thought this map was to select where to write to, guess it does write replication on the back end.
I thought pools are completely separate and clients would not see each others data?
Thank you Martin!
Hello Vlad,
rule dc1_primary_dc2_secondary {
id 1
type replicated
min_size 1
max_size 10
step take dc1
step chooseleaf firstn 1 type host
step emit
step take dc2
step chooseleaf firstn 1 type host
step emit
step take dc3
step chooseleaf firstn -2 type host
step emit
}
rule dc2_primary_dc1_secondary {
id 2
type replicated
min_size 1
max_size 10
step take dc1
step chooseleaf firstn 1 type host
step emit
step take dc2
step chooseleaf firstn 1 type host
step emit
step take dc3
step chooseleaf firstn -2 type host
step emit
}
~ $ ceph osd pool set <pool_for_dc1> crush_ruleset 1
~ $ ceph osd pool set <pool_for_dc2> crush_ruleset 2
Now you place your workload from dc1 to the dc1 pool, and workload
from dc2 to the dc2 pool. You could also use HDD with SSD journal (if
your workload issn't that write intensive) and save some money in dc3
as your client would always read from a SSD and write to Hybrid.
Btw. all this could be done with a few simple clicks through our web
frontend. Even if you want to export it via CephFS / NFS / .. it is
possible to set it on a per folder level. Feel free to take a look at
http://youtu.be/V33f7ipw9d4 http://youtu.be/V33f7ipw9d4 to see how easy it could
be.
--
Martin Verges
Managing director
Mobile: +49 174 9335695
Chat: https://t.me/MartinVerges <https://t.me/MartinVerges>
croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web: https://croit.io <https://croit.io/>
YouTube: https://goo.gl/PGE1Bx <https://goo.gl/PGE1Bx>
Post by Vlad Kopylov
Please disregard pg status, one of test vms was down for some time it is
healing.
Question only how to make it read from proper datacenter
If you have an example.
Thanks
Post by Vlad Kopylov
Martin, thank you for the tip.
googling ceph crush rule examples doesn't give much on rules, just static
placement of buckets.
this all seems to be for placing data, not to giving client in specific
datacenter proper read osd
maybe something wrong with placement groups?
I added datacenter dc1 dc2 dc3
Current replicated_rule is
rule replicated_rule {
id 0
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}
# buckets
host ceph1 {
id -3 # do not change unnecessarily
id -2 class ssd # do not change unnecessarily
# weight 1.000
alg straw2
hash 0 # rjenkins1
item osd.0 weight 1.000
}
datacenter dc1 {
id -9 # do not change unnecessarily
id -4 class ssd # do not change unnecessarily
# weight 1.000
alg straw2
hash 0 # rjenkins1
item ceph1 weight 1.000
}
host ceph2 {
id -5 # do not change unnecessarily
id -6 class ssd # do not change unnecessarily
# weight 1.000
alg straw2
hash 0 # rjenkins1
item osd.1 weight 1.000
}
datacenter dc2 {
id -10 # do not change unnecessarily
id -8 class ssd # do not change unnecessarily
# weight 1.000
alg straw2
hash 0 # rjenkins1
item ceph2 weight 1.000
}
host ceph3 {
id -7 # do not change unnecessarily
id -12 class ssd # do not change unnecessarily
# weight 1.000
alg straw2
hash 0 # rjenkins1
item osd.2 weight 1.000
}
datacenter dc3 {
id -11 # do not change unnecessarily
id -13 class ssd # do not change unnecessarily
# weight 1.000
alg straw2
hash 0 # rjenkins1
item ceph3 weight 1.000
}
root default {
id -1 # do not change unnecessarily
id -14 class ssd # do not change unnecessarily
# weight 3.000
alg straw2
hash 0 # rjenkins1
item dc1 weight 1.000
item dc2 weight 1.000
item dc3 weight 1.000
}
#ceph pg dump
dumped all
version 29433
stamp 2018-11-09 11:23:44.510872
last_osdmap_epoch 0
last_pg_scan 0
PG_STAT OBJECTS MISSING_ON_PRIMARY DEGRADED MISPLACED UNFOUND BYTES LOG
DISK_LOG STATE STATE_STAMP VERSION
REPORTED UP UP_PRIMARY ACTING ACTING_PRIMARY LAST_SCRUB SCRUB_STAMP
LAST_DEEP_SCRUB DEEP_SCRUB_STAMP SNAPTRIMQ_LEN
1.5f 0 0 0 0 0 0
0 0 active+clean 2018-11-09 04:35:32.320607 0'0
544:1317 [0,2,1] 0 [0,2,1] 0 0'0 2018-11-09
04:35:32.320561 0'0 2018-11-04 11:55:54.756115 0
2.5c 143 0 143 0 0 19490267
461 461 active+undersized+degraded 2018-11-08 19:02:03.873218 508'461
544:2100 [2,1] 2 [2,1] 2 290'380 2018-11-07
18:58:43.043719 64'120 2018-11-05 14:21:49.256324 0
.....
sum 15239 0 2053 2659 0 2157615019 58286 58286
OSD_STAT USED AVAIL TOTAL HB_PEERS PG_SUM PRIMARY_PG_SUM
2 3.7 GiB 28 GiB 32 GiB [0,1] 200 73
1 3.7 GiB 28 GiB 32 GiB [0,2] 200 58
0 3.7 GiB 28 GiB 32 GiB [1,2] 173 69
sum 11 GiB 85 GiB 96 GiB
#ceph pg map 2.5c
osdmap e545 pg 2.5c (2.5c) -> up [2,1] acting [2,1]
#pg map 1.5f
osdmap e547 pg 1.5f (1.5f) -> up [0,2,1] acting [0,2,1]
Post by Martin Verges
Hello Vlad,
Ceph clients connect to the primary OSD of each PG. If you create a
crush rule for building1 and one for building2 that takes a OSD from
the same building as the first one, your reads to the pool will always
be on the same building (if the cluster is healthy) and only write
request get replicated to the other building.
--
Martin Verges
Managing director
Mobile: +49 174 9335695
Chat: https://t.me/MartinVerges <https://t.me/MartinVerges>
croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web: https://croit.io <https://croit.io/>
YouTube: https://goo.gl/PGE1Bx <https://goo.gl/PGE1Bx>
I am trying to test replicated ceph with servers in different buildings, and
I have a read problem.
Reads from one building go to osd in another building and vice versa, making
reads slower then writes! Making read as slow as slowest node.
Is there a way to
- disable parallel read (so it reads only from the same osd node where mon
is);
- or give each client read restriction per osd?
- or maybe strictly specify read osd on mount;
- or have node read delay cap (for example if node time out is larger then 2
ms then do not use such node for read as other replicas are available).
- or ability to place Clients on the Crush map - so it understands that osd
in - for example osd in the same data-center as client has preference, and
pull data from it/them.
Mounting with kernel client latest mimic.
Thank you!
Vlad
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Vlad Kopylov
2018-11-14 00:53:52 UTC
Permalink
Each of 3 clients from different buildings are picking same
primary-affinity, and everything is slow at least on two.
Instead of just read from their local OSD they read mostly from
primary-affinity.

*What I need is something like primary-affinity for each client connection*

ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 0.08189 root default
-3 0.02730 host vm1
0 hdd 0.02730 osd.0 up 1.00000 1.00000
-10 0.02730 host vm2
1 hdd 0.02730 osd.1 up 1.00000 0.50000
-5 0.02730 host vm3
2 hdd 0.02730 osd.2 up 1.00000 0.50000

v
Post by Jean-Charles Lopez
Hi Vlad,
No need for a specific CRUSH map configuration. I’d suggest you use the
primary-affinity setting on the OSD so that only the OSDs that are close to
your read point are are selected as primary.
See https://ceph.com/geen-categorie/ceph-primary-affinity/ for information
Just set the primary affinity of all the OSDs in building 2 to 0.
Only the OSDs in building 1 should then be used as primary OSDs.
BR
JC
Or is it possible to mount one OSD directly for read file access?
v
Post by Vlad Kopylov
Maybe it is possible if done via gateway-nfs export?
Settings for gateway allow read osd selection?
v
Post by Martin Verges
Hello Vlad,
If you want to read from the same data, then it ist not possible (as far I know).
--
Martin Verges
Managing director
Mobile: +49 174 9335695
Chat: https://t.me/MartinVerges
croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web: https://croit.io
YouTube: https://goo.gl/PGE1Bx
Maybe i missed something but FS is explicitly selecting pools to put
files and metadata, like I did below.
So if I create new pools - data in them will be different. If I apply
the rule dc1_primary to cfs_data pool, and client from dc3 connects to fs
t01 - it will start using dc1 hosts
ceph osd pool create cfs_data 100
ceph osd pool create cfs_meta 100
ceph fs new t01 cfs_data cfs_meta
sudo mount -t ceph ceph1:6789:/ /mnt/t01 -o
name=admin,secretfile=/home/mciadmin/admin.secret
rule dc1_primary {
id 1
type replicated
min_size 1
max_size 10
step take dc1
step chooseleaf firstn 1 type host
step emit
step take dc2
step chooseleaf firstn -2 type host
step emit
step take dc3
step chooseleaf firstn -2 type host
step emit
}
Just to confirm - it will still populate 3 copies in each datacenter?
Thought this map was to select where to write to, guess it does write
replication on the back end.
I thought pools are completely separate and clients would not see each others data?
Thank you Martin!
Post by Martin Verges
Hello Vlad,
rule dc1_primary_dc2_secondary {
id 1
type replicated
min_size 1
max_size 10
step take dc1
step chooseleaf firstn 1 type host
step emit
step take dc2
step chooseleaf firstn 1 type host
step emit
step take dc3
step chooseleaf firstn -2 type host
step emit
}
rule dc2_primary_dc1_secondary {
id 2
type replicated
min_size 1
max_size 10
step take dc1
step chooseleaf firstn 1 type host
step emit
step take dc2
step chooseleaf firstn 1 type host
step emit
step take dc3
step chooseleaf firstn -2 type host
step emit
}
~ $ ceph osd pool set <pool_for_dc1> crush_ruleset 1
~ $ ceph osd pool set <pool_for_dc2> crush_ruleset 2
Now you place your workload from dc1 to the dc1 pool, and workload
from dc2 to the dc2 pool. You could also use HDD with SSD journal (if
your workload issn't that write intensive) and save some money in dc3
as your client would always read from a SSD and write to Hybrid.
Btw. all this could be done with a few simple clicks through our web
frontend. Even if you want to export it via CephFS / NFS / .. it is
possible to set it on a per folder level. Feel free to take a look at
http://youtu.be/V33f7ipw9d4 to see how easy it could
be.
--
Martin Verges
Managing director
Mobile: +49 174 9335695
Chat: https://t.me/MartinVerges
croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web: https://croit.io
YouTube: https://goo.gl/PGE1Bx
Post by Vlad Kopylov
Please disregard pg status, one of test vms was down for some time
it is
Post by Vlad Kopylov
healing.
Question only how to make it read from proper datacenter
If you have an example.
Thanks
Post by Vlad Kopylov
Martin, thank you for the tip.
googling ceph crush rule examples doesn't give much on rules, just
static
Post by Vlad Kopylov
Post by Vlad Kopylov
placement of buckets.
this all seems to be for placing data, not to giving client in
specific
Post by Vlad Kopylov
Post by Vlad Kopylov
datacenter proper read osd
maybe something wrong with placement groups?
I added datacenter dc1 dc2 dc3
Current replicated_rule is
rule replicated_rule {
id 0
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}
# buckets
host ceph1 {
id -3 # do not change unnecessarily
id -2 class ssd # do not change unnecessarily
# weight 1.000
alg straw2
hash 0 # rjenkins1
item osd.0 weight 1.000
}
datacenter dc1 {
id -9 # do not change unnecessarily
id -4 class ssd # do not change unnecessarily
# weight 1.000
alg straw2
hash 0 # rjenkins1
item ceph1 weight 1.000
}
host ceph2 {
id -5 # do not change unnecessarily
id -6 class ssd # do not change unnecessarily
# weight 1.000
alg straw2
hash 0 # rjenkins1
item osd.1 weight 1.000
}
datacenter dc2 {
id -10 # do not change unnecessarily
id -8 class ssd # do not change unnecessarily
# weight 1.000
alg straw2
hash 0 # rjenkins1
item ceph2 weight 1.000
}
host ceph3 {
id -7 # do not change unnecessarily
id -12 class ssd # do not change unnecessarily
# weight 1.000
alg straw2
hash 0 # rjenkins1
item osd.2 weight 1.000
}
datacenter dc3 {
id -11 # do not change unnecessarily
id -13 class ssd # do not change unnecessarily
# weight 1.000
alg straw2
hash 0 # rjenkins1
item ceph3 weight 1.000
}
root default {
id -1 # do not change unnecessarily
id -14 class ssd # do not change unnecessarily
# weight 3.000
alg straw2
hash 0 # rjenkins1
item dc1 weight 1.000
item dc2 weight 1.000
item dc3 weight 1.000
}
#ceph pg dump
dumped all
version 29433
stamp 2018-11-09 11:23:44.510872
last_osdmap_epoch 0
last_pg_scan 0
PG_STAT OBJECTS MISSING_ON_PRIMARY DEGRADED MISPLACED UNFOUND
BYTES LOG
Post by Vlad Kopylov
Post by Vlad Kopylov
DISK_LOG STATE STATE_STAMP
VERSION
Post by Vlad Kopylov
Post by Vlad Kopylov
REPORTED UP UP_PRIMARY ACTING ACTING_PRIMARY LAST_SCRUB
SCRUB_STAMP
Post by Vlad Kopylov
Post by Vlad Kopylov
LAST_DEEP_SCRUB DEEP_SCRUB_STAMP SNAPTRIMQ_LEN
1.5f 0 0 0 0 0
0
Post by Vlad Kopylov
Post by Vlad Kopylov
0 0 active+clean 2018-11-09 04:35:32.320607
0'0
Post by Vlad Kopylov
Post by Vlad Kopylov
544:1317 [0,2,1] 0 [0,2,1] 0 0'0
2018-11-09
Post by Vlad Kopylov
Post by Vlad Kopylov
04:35:32.320561 0'0 2018-11-04 11:55:54.756115
0
Post by Vlad Kopylov
Post by Vlad Kopylov
2.5c 143 0 143 0 0
19490267
Post by Vlad Kopylov
Post by Vlad Kopylov
461 461 active+undersized+degraded 2018-11-08
19:02:03.873218 508'461
Post by Vlad Kopylov
Post by Vlad Kopylov
544:2100 [2,1] 2 [2,1] 2 290'380
2018-11-07
Post by Vlad Kopylov
Post by Vlad Kopylov
18:58:43.043719 64'120 2018-11-05 14:21:49.256324
0
Post by Vlad Kopylov
Post by Vlad Kopylov
.....
sum 15239 0 2053 2659 0 2157615019 58286 58286
OSD_STAT USED AVAIL TOTAL HB_PEERS PG_SUM PRIMARY_PG_SUM
2 3.7 GiB 28 GiB 32 GiB [0,1] 200 73
1 3.7 GiB 28 GiB 32 GiB [0,2] 200 58
0 3.7 GiB 28 GiB 32 GiB [1,2] 173 69
sum 11 GiB 85 GiB 96 GiB
#ceph pg map 2.5c
osdmap e545 pg 2.5c (2.5c) -> up [2,1] acting [2,1]
#pg map 1.5f
osdmap e547 pg 1.5f (1.5f) -> up [0,2,1] acting [0,2,1]
On Fri, Nov 9, 2018 at 2:21 AM Martin Verges <
Post by Martin Verges
Hello Vlad,
Ceph clients connect to the primary OSD of each PG. If you create
a
Post by Vlad Kopylov
Post by Vlad Kopylov
Post by Martin Verges
crush rule for building1 and one for building2 that takes a OSD
from
Post by Vlad Kopylov
Post by Vlad Kopylov
Post by Martin Verges
the same building as the first one, your reads to the pool will
always
Post by Vlad Kopylov
Post by Vlad Kopylov
Post by Martin Verges
be on the same building (if the cluster is healthy) and only write
request get replicated to the other building.
--
Martin Verges
Managing director
Mobile: +49 174 9335695
Chat: https://t.me/MartinVerges
croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web: https://croit.io
YouTube: https://goo.gl/PGE1Bx
Post by Vlad Kopylov
I am trying to test replicated ceph with servers in different
buildings, and
I have a read problem.
Reads from one building go to osd in another building and vice
versa,
Post by Vlad Kopylov
Post by Vlad Kopylov
Post by Martin Verges
Post by Vlad Kopylov
making
reads slower then writes! Making read as slow as slowest node.
Is there a way to
- disable parallel read (so it reads only from the same osd
node where
Post by Vlad Kopylov
Post by Vlad Kopylov
Post by Martin Verges
Post by Vlad Kopylov
mon
is);
- or give each client read restriction per osd?
- or maybe strictly specify read osd on mount;
- or have node read delay cap (for example if node time out is
larger
Post by Vlad Kopylov
Post by Vlad Kopylov
Post by Martin Verges
Post by Vlad Kopylov
then 2
ms then do not use such node for read as other replicas are
available).
Post by Vlad Kopylov
Post by Vlad Kopylov
Post by Martin Verges
Post by Vlad Kopylov
- or ability to place Clients on the Crush map - so it
understands that
Post by Vlad Kopylov
Post by Vlad Kopylov
Post by Martin Verges
Post by Vlad Kopylov
osd
in - for example osd in the same data-center as client has
preference,
Post by Vlad Kopylov
Post by Vlad Kopylov
Post by Martin Verges
Post by Vlad Kopylov
and
pull data from it/them.
Mounting with kernel client latest mimic.
Thank you!
Vlad
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Konstantin Shalygin
2018-11-14 05:11:02 UTC
Permalink
Post by Vlad Kopylov
Or is it possible to mount one OSD directly for read file access?
In Ceph is impossible to io directly to OSD, only to PG.



k
Vlad Kopylov
2018-11-15 02:31:40 UTC
Permalink
Thanks Konstantin, I already tried accessing it in different ways and best
I got is bulk renamed files and other non presentable data.

Maybe to solve this I can create overlapping osd pools?
Like one pool includes all 3 osd for replication, and 3 more include one
osd at each site with same blocks?

v
Post by Vlad Kopylov
Or is it possible to mount one OSD directly for read file access?
In Ceph is impossible to io directly to OSD, only to PG.
k
Konstantin Shalygin
2018-11-15 02:52:04 UTC
Permalink
Post by Vlad Kopylov
Thanks Konstantin, I already tried accessing it in different ways and
best I got is bulk renamed files and other non presentable data.
Maybe to solve this I can create overlapping osd pools?
Like one pool includes all 3 osd for replication, and 3 more include
one osd at each site with same blocks?
As far as I understand, you need something like this:


vm1 io -> building1 osds only

vm2 io -> building2 osds only

vm3 io -> buildgin3 osds only


Right?



k
Vlad Kopylov
2018-11-16 04:57:22 UTC
Permalink
Exactly. But write operations should go to all nodes.

v
Post by Konstantin Shalygin
Post by Vlad Kopylov
Thanks Konstantin, I already tried accessing it in different ways and
best I got is bulk renamed files and other non presentable data.
Maybe to solve this I can create overlapping osd pools?
Like one pool includes all 3 osd for replication, and 3 more include
one osd at each site with same blocks?
vm1 io -> building1 osds only
vm2 io -> building2 osds only
vm3 io -> buildgin3 osds only
Right?
k
Konstantin Shalygin
2018-11-16 08:43:52 UTC
Permalink
Post by Vlad Kopylov
Exactly. But write operations should go to all nodes.
This can be set via primary affinity [1], when a ceph client reads or
writes data, it always contacts the primary OSD in the acting set.


If u want to totally segregate IO, you can use device classes:

Just create osds with different classes:

dc1

  host1

    red osd.0 primary

    blue osd.1

    green osd.2

dc2

  host2

    red osd.3

    blue osd.4 primary

    green osd.5

dc3

  host3

    red osd.6

    blue osd.7

    green osd.8 primary


create 3 crush rules:

ceph osd crush rule create-replicated red default host red

ceph osd crush rule create-replicated blue default host blue

ceph osd crush rule create-replicated green default host green


and 3 pools:

ceph osd pool create red 64 64 replicated red

ceph osd pool create blue 64 64 replicated blue

ceph osd pool create blue 64 64 replicated green


[1]
http://docs.ceph.com/docs/master/rados/operations/crush-map/#primary-affinity'



k
Vlad Kopylov
2018-11-16 18:07:32 UTC
Permalink
This is what Jean suggested. I understand it and it works with primary.
*But what I need is for all clients to access same files, not separate sets
(like red blue green)*

Thanks Konstantin.
Post by Konstantin Shalygin
Post by Vlad Kopylov
Exactly. But write operations should go to all nodes.
This can be set via primary affinity [1], when a ceph client reads or
writes data, it always contacts the primary OSD in the acting set.
dc1
host1
red osd.0 primary
blue osd.1
green osd.2
dc2
host2
red osd.3
blue osd.4 primary
green osd.5
dc3
host3
red osd.6
blue osd.7
green osd.8 primary
ceph osd crush rule create-replicated red default host red
ceph osd crush rule create-replicated blue default host blue
ceph osd crush rule create-replicated green default host green
ceph osd pool create red 64 64 replicated red
ceph osd pool create blue 64 64 replicated blue
ceph osd pool create blue 64 64 replicated green
[1]
http://docs.ceph.com/docs/master/rados/operations/crush-map/#primary-affinity
'
k
Konstantin Shalygin
2018-11-19 07:54:12 UTC
Permalink
Post by Vlad Kopylov
This is what Jean suggested. I understand it and it works with primary.
*But what I need is for all clients to access same files, not separate
sets (like red blue green)*
You should look to other solutions, like GlusterFS. Ceph is overhead for
this case IMHO.



k
Vlad Kopylov
2018-11-20 02:31:28 UTC
Permalink
Yes. Using GlusterFS now.
But Ceph has best write replication which I am struggling to make gluster
guys implement.

If this read replica pick issue could be fixed ceph could be a good cloud
fs not just local network RAID.
Post by Vlad Kopylov
This is what Jean suggested. I understand it and it works with primary.
*But what I need is for all clients to access same files, not separate
sets (like red blue green)*
You should look to other solutions, like GlusterFS. Ceph is overhead for
this case IMHO.
k
Patrick Donnelly
2018-11-21 01:29:46 UTC
Permalink
You either need to accept that reads/writes will land on different data
centers, primary OSD for a given pool is always in the desired data center,
or some other non-Ceph solution which will have either expensive, eventual,
or false consistency.
Post by Vlad Kopylov
This is what Jean suggested. I understand it and it works with primary.
*But what I need is for all clients to access same files, not separate
sets (like red blue green)*
Thanks Konstantin.
Post by Konstantin Shalygin
Post by Vlad Kopylov
Exactly. But write operations should go to all nodes.
This can be set via primary affinity [1], when a ceph client reads or
writes data, it always contacts the primary OSD in the acting set.
dc1
host1
red osd.0 primary
blue osd.1
green osd.2
dc2
host2
red osd.3
blue osd.4 primary
green osd.5
dc3
host3
red osd.6
blue osd.7
green osd.8 primary
ceph osd crush rule create-replicated red default host red
ceph osd crush rule create-replicated blue default host blue
ceph osd crush rule create-replicated green default host green
ceph osd pool create red 64 64 replicated red
ceph osd pool create blue 64 64 replicated blue
ceph osd pool create blue 64 64 replicated green
[1]
http://docs.ceph.com/docs/master/rados/operations/crush-map/#primary-affinity
'
k
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Vlad Kopylov
2018-11-21 02:49:56 UTC
Permalink
I see the point, but not for the read case:
no overhead for just choosing or let Mount option choose read replica.

This is simple feature that can be implemented, that will save many
people bandwidth in really distributed cases.

Main issue this surfaces is that RADOS maps ignore clients - they just
see cluster. There should be the part of RADOS map unique or possibly
unique for each client connection.

Lets file feature request?

p.s. honestly, I don't see why anyone would use ceph for local network
RAID setups, there are other simple solutions out there even in your
own RedHat shop.
You either need to accept that reads/writes will land on different data centers, primary OSD for a given pool is always in the desired data center, or some other non-Ceph solution which will have either expensive, eventual, or false consistency.
Post by Vlad Kopylov
This is what Jean suggested. I understand it and it works with primary.
But what I need is for all clients to access same files, not separate sets (like red blue green)
Thanks Konstantin.
Post by Konstantin Shalygin
Post by Vlad Kopylov
Exactly. But write operations should go to all nodes.
This can be set via primary affinity [1], when a ceph client reads or
writes data, it always contacts the primary OSD in the acting set.
dc1
host1
red osd.0 primary
blue osd.1
green osd.2
dc2
host2
red osd.3
blue osd.4 primary
green osd.5
dc3
host3
red osd.6
blue osd.7
green osd.8 primary
ceph osd crush rule create-replicated red default host red
ceph osd crush rule create-replicated blue default host blue
ceph osd crush rule create-replicated green default host green
ceph osd pool create red 64 64 replicated red
ceph osd pool create blue 64 64 replicated blue
ceph osd pool create blue 64 64 replicated green
[1]
http://docs.ceph.com/docs/master/rados/operations/crush-map/#primary-affinity'
k
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Gregory Farnum
2018-11-26 13:47:43 UTC
Permalink
Post by Vlad Kopylov
no overhead for just choosing or let Mount option choose read replica.
This is simple feature that can be implemented, that will save many
people bandwidth in really distributed cases.
This is actually much more complicated than it sounds. Allowing reads from
the replica OSDs while still routing writes through a different primary OSD
introduces a great many consistency issues. We've tried adding very limited
support for this read-from-replica scenario in special cases, but have had
to roll them all back due to edge cases where they don't work.

I understand why you want it, but it's definitely not a simple feature. :(
-Greg
Post by Vlad Kopylov
Main issue this surfaces is that RADOS maps ignore clients - they just
see cluster. There should be the part of RADOS map unique or possibly
unique for each client connection.
Lets file feature request?
p.s. honestly, I don't see why anyone would use ceph for local network
RAID setups, there are other simple solutions out there even in your
own RedHat shop.
Post by Patrick Donnelly
You either need to accept that reads/writes will land on different data
centers, primary OSD for a given pool is always in the desired data center,
or some other non-Ceph solution which will have either expensive, eventual,
or false consistency.
Post by Patrick Donnelly
Post by Vlad Kopylov
This is what Jean suggested. I understand it and it works with primary.
But what I need is for all clients to access same files, not separate
sets (like red blue green)
Post by Patrick Donnelly
Post by Vlad Kopylov
Thanks Konstantin.
Post by Konstantin Shalygin
Post by Vlad Kopylov
Exactly. But write operations should go to all nodes.
This can be set via primary affinity [1], when a ceph client reads or
writes data, it always contacts the primary OSD in the acting set.
dc1
host1
red osd.0 primary
blue osd.1
green osd.2
dc2
host2
red osd.3
blue osd.4 primary
green osd.5
dc3
host3
red osd.6
blue osd.7
green osd.8 primary
ceph osd crush rule create-replicated red default host red
ceph osd crush rule create-replicated blue default host blue
ceph osd crush rule create-replicated green default host green
ceph osd pool create red 64 64 replicated red
ceph osd pool create blue 64 64 replicated blue
ceph osd pool create blue 64 64 replicated green
[1]
http://docs.ceph.com/docs/master/rados/operations/crush-map/#primary-affinity
'
Post by Patrick Donnelly
Post by Vlad Kopylov
Post by Konstantin Shalygin
k
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Vlad Kopylov
2018-11-26 16:08:38 UTC
Permalink
I see. Thank you Greg.

Ultimately leading to some kind of multi-primary OSD/MON setup, which
will most likely add lookup overheads. Though might be a reasonable
trade off for network distributed setups.
Good feature for major version.

With Glusterfs I solved it, funny as it sounds, by writing tiny fuse
fs as overlay, making all reads locally and writes to cluster. Having
that with Glusterfs there are real files on each node for local reads.

Wish there was a way to file-access local OSD so I can use same approach.

-vlad
Post by Vlad Kopylov
no overhead for just choosing or let Mount option choose read replica.
This is simple feature that can be implemented, that will save many
people bandwidth in really distributed cases.
This is actually much more complicated than it sounds. Allowing reads from the replica OSDs while still routing writes through a different primary OSD introduces a great many consistency issues. We've tried adding very limited support for this read-from-replica scenario in special cases, but have had to roll them all back due to edge cases where they don't work.
I understand why you want it, but it's definitely not a simple feature. :(
-Greg
Post by Vlad Kopylov
Main issue this surfaces is that RADOS maps ignore clients - they just
see cluster. There should be the part of RADOS map unique or possibly
unique for each client connection.
Lets file feature request?
p.s. honestly, I don't see why anyone would use ceph for local network
RAID setups, there are other simple solutions out there even in your
own RedHat shop.
You either need to accept that reads/writes will land on different data centers, primary OSD for a given pool is always in the desired data center, or some other non-Ceph solution which will have either expensive, eventual, or false consistency.
Post by Vlad Kopylov
This is what Jean suggested. I understand it and it works with primary.
But what I need is for all clients to access same files, not separate sets (like red blue green)
Thanks Konstantin.
Post by Konstantin Shalygin
Post by Vlad Kopylov
Exactly. But write operations should go to all nodes.
This can be set via primary affinity [1], when a ceph client reads or
writes data, it always contacts the primary OSD in the acting set.
dc1
host1
red osd.0 primary
blue osd.1
green osd.2
dc2
host2
red osd.3
blue osd.4 primary
green osd.5
dc3
host3
red osd.6
blue osd.7
green osd.8 primary
ceph osd crush rule create-replicated red default host red
ceph osd crush rule create-replicated blue default host blue
ceph osd crush rule create-replicated green default host green
ceph osd pool create red 64 64 replicated red
ceph osd pool create blue 64 64 replicated blue
ceph osd pool create blue 64 64 replicated green
[1]
http://docs.ceph.com/docs/master/rados/operations/crush-map/#primary-affinity'
k
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Loading...