Discussion:
High apply latency
(too old to reply)
Jakub Jaszewski
2018-01-30 12:01:15 UTC
Permalink
Raw Message
Hi,

We observe high apply_latency(ms) and poor write performance I believe.
In logs there are repetitive slow request warnings related different OSDs
and servers.

ceph versions 12.2.2

Cluster HW description:
9x Dell PowerEdge R730xd

1x Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz (10C/20T)
256 GB 2133 MHz DDR4
PERC H730 Mini 1GB cache
12x 4TB TOSHIBA MG04ACA400N (each disk configured as RAID 0 for single OSD)
2x 480GB SSDSC2BB480G7R (in RAID 1 for operating system)
1x NVMe PCIe SSD - Intel SSD DC D3700 Series for journaling, one for 12 OSDs
1x QLogic 57800 2x 10Gb DAC SFP+ and 2x 1Gb (configured as 2x10Gb in
802.3ad Dynamic link aggregation)


/etc/ceph/ceph.conf
[global]
fsid = 1023c49f-3b10-42de-9f62-9b122db32a9a
mon_initial_members = host01,host02,host03
mon_host = 10.212.32.18,10.212.32.19,10.212.32.20
auth_supported = cephx
public_network = 10.212.32.0/24
cluster_network = 10.212.14.0/24
rgw thread pool size = 256
[client.rgw.host01]
rgw host = host01
rgw enable usage log = true
[client.rgw.host02]
rgw host = host02
rgw enable usage log = true
[client.rgw.host03]
rgw host = host03
rgw enable usage log = true
[osd]
filestore xattr use omap = true
osd journal size = 10240
osd mount options xfs = noatime,inode64,logbsize=256k,logbufs=8
osd crush location hook = /usr/bin/ceph-crush-location.sh
osd pool default size = 3
[mon]
mon compact on start = true
mon compact on trim = true


OSDs topology as below. Actually I wonder why some OSDs have "hdd" class
assigned.

ID CLASS WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR PGS TYPE NAME

-1 392.73273 - 385T 236T 149T 61.24 1.00 - root
default
-6 392.73273 - 385T 236T 149T 61.24 1.00 - region
region01
-5 392.73273 - 385T 236T 149T 61.24 1.00 -
datacenter dc01
-4 392.73273 - 385T 236T 149T 61.24 1.00 -
room room01
-8 43.63699 - 44684G 26573G 18111G 59.47 0.97 -
rack rack01
-7 43.63699 - 44684G 26573G 18111G 59.47 0.97 -
host host01
0 3.63599 1.00000 3723G 2405G 1318G 64.59 1.05 210
osd.0
2 3.63599 1.00000 3723G 2004G 1719G 53.82 0.88 190
osd.2
4 3.63599 1.00000 3723G 2460G 1262G 66.08 1.08 214
osd.4
6 3.63599 1.00000 3723G 2474G 1249G 66.45 1.09 203
osd.6
8 3.63599 1.00000 3723G 2308G 1415G 61.99 1.01 220
osd.8
11 3.63599 1.00000 3723G 2356G 1367G 63.28 1.03 214
osd.11
12 3.63599 1.00000 3723G 2303G 1420G 61.86 1.01 206
osd.12
14 3.63599 1.00000 3723G 1920G 1803G 51.57 0.84 178
osd.14
16 3.63599 1.00000 3723G 2236G 1486G 60.07 0.98 203
osd.16
18 3.63599 1.00000 3723G 2203G 1520G 59.17 0.97 193
osd.18
20 3.63599 1.00000 3723G 1904G 1819G 51.15 0.84 179
osd.20
22 3.63599 1.00000 3723G 1995G 1728G 53.58 0.88 192
osd.22
-3 43.63699 - 44684G 27090G 17593G 60.63 0.99 -
rack rack02
-2 43.63699 - 44684G 27090G 17593G 60.63 0.99 -
host host02
1 hdd 3.63599 1.00000 3723G 2447G 1275G 65.74 1.07 213
osd.1
3 hdd 3.63599 1.00000 3723G 2696G 1027G 72.41 1.18 210
osd.3
5 hdd 3.63599 1.00000 3723G 2290G 1433G 61.51 1.00 188
osd.5
7 hdd 3.63599 1.00000 3723G 2171G 1551G 58.33 0.95 194
osd.7
9 hdd 3.63599 1.00000 3723G 2129G 1594G 57.18 0.93 204
osd.9
10 hdd 3.63599 1.00000 3723G 2153G 1570G 57.82 0.94 184
osd.10
13 hdd 3.63599 1.00000 3723G 2142G 1580G 57.55 0.94 188
osd.13
15 hdd 3.63599 1.00000 3723G 2147G 1576G 57.66 0.94 192
osd.15
17 hdd 3.63599 1.00000 3723G 2093G 1630G 56.21 0.92 201
osd.17
19 hdd 3.63599 1.00000 3723G 2079G 1643G 55.86 0.91 192
osd.19
21 hdd 3.63599 1.00000 3723G 2266G 1457G 60.87 0.99 201
osd.21
23 hdd 3.63599 1.00000 3723G 2472G 1251G 66.39 1.08 197
osd.23
-10 43.63699 - 44684G 27247G 17436G 60.98 1.00 -
rack rack03
-9 43.63699 - 44684G 27247G 17436G 60.98 1.00 -
host host03
24 3.63599 1.00000 3723G 2289G 1433G 61.49 1.00 195
osd.24
25 3.63599 1.00000 3723G 2584G 1138G 69.42 1.13 217
osd.25
26 3.63599 1.00000 3723G 2183G 1539G 58.65 0.96 198
osd.26
28 3.63599 1.00000 3723G 2540G 1182G 68.23 1.11 215
osd.28
30 3.63599 1.00000 3723G 2134G 1589G 57.31 0.94 207
osd.30
32 3.63599 1.00000 3723G 1715G 2008G 46.06 0.75 183
osd.32
34 3.63599 1.00000 3723G 2576G 1146G 69.20 1.13 219
osd.34
36 3.63599 1.00000 3723G 2166G 1557G 58.18 0.95 208
osd.36
38 3.63599 1.00000 3723G 2050G 1673G 55.06 0.90 191
osd.38
40 3.63599 1.00000 3723G 2542G 1181G 68.28 1.11 223
osd.40
42 3.63599 1.00000 3723G 2063G 1660G 55.42 0.90 181
osd.42
44 3.63599 1.00000 3723G 2399G 1324G 64.44 1.05 197
osd.44
-12 43.63699 - 44684G 27152G 17531G 60.76 0.99 -
rack rack04
-11 43.63699 - 44684G 27152G 17531G 60.76 0.99 -
host host04
27 3.63599 1.00000 3723G 2479G 1243G 66.60 1.09 212
osd.27
29 3.63599 1.00000 3723G 2646G 1077G 71.07 1.16 214
osd.29
31 3.63599 1.00000 3723G 2217G 1506G 59.55 0.97 202
osd.31
33 3.63599 1.00000 3723G 2073G 1650G 55.68 0.91 198
osd.33
35 3.63599 1.00000 3723G 2661G 1061G 71.48 1.17 205
osd.35
37 3.63599 1.00000 3723G 2097G 1626G 56.32 0.92 181
osd.37
39 3.63599 1.00000 3723G 1949G 1774G 52.35 0.85 196
osd.39
41 3.63599 1.00000 3723G 2205G 1518G 59.23 0.97 188
osd.41
43 3.63599 1.00000 3723G 2417G 1306G 64.92 1.06 214
osd.43
45 3.63599 1.00000 3723G 1925G 1798G 51.71 0.84 189
osd.45
46 3.63599 1.00000 3723G 2235G 1488G 60.03 0.98 208
osd.46
47 3.63599 1.00000 3723G 2243G 1480G 60.24 0.98 195
osd.47
-14 43.63699 - 37236G 25721G 11515G 69.08 1.13 -
rack rack05
-13 43.63699 - 37236G 25721G 11515G 69.08 1.13 -
host host05
48 3.63599 1.00000 3723G 2160G 1563G 58.01 0.95 213
osd.48
49 3.63599 1.00000 3723G 2603G 1120G 69.91 1.14 216
osd.49
50 3.63599 1.00000 3723G 2497G 1225G 67.08 1.10 220
osd.50
51 3.63599 1.00000 3723G 2613G 1110G 70.17 1.15 223
osd.51
52 3.63599 0 0 0 0 0 0 0
osd.52
53 3.63599 1.00000 3723G 2869G 854G 77.05 1.26 235
osd.53
54 3.63599 1.00000 3723G 3021G 702G 81.14 1.32 245
osd.54
55 3.63599 1.00000 3723G 2190G 1533G 58.83 0.96 209
osd.55
56 3.63599 0 0 0 0 0 0 0
osd.56
57 3.63599 1.00000 3723G 2651G 1072G 71.21 1.16 223
osd.57
58 3.63599 1.00000 3723G 2413G 1310G 64.81 1.06 198
osd.58
59 3.63599 1.00000 3723G 2701G 1021G 72.56 1.18 226
osd.59
-16 43.63699 - 44684G 28445G 16238G 63.66 1.04 -
rack rack06
-15 43.63699 - 44684G 28445G 16238G 63.66 1.04 -
host host06
60 hdd 3.63599 1.00000 3723G 1994G 1728G 53.57 0.87 175
osd.60
61 hdd 3.63599 1.00000 3723G 2531G 1192G 67.97 1.11 216
osd.61
62 hdd 3.63599 1.00000 3723G 2472G 1251G 66.40 1.08 217
osd.62
63 hdd 3.63599 1.00000 3723G 2588G 1135G 69.50 1.13 195
osd.63
64 hdd 3.63599 1.00000 3723G 2804G 919G 75.30 1.23 216
osd.64
65 hdd 3.63599 1.00000 3723G 2534G 1188G 68.08 1.11 208
osd.65
66 hdd 3.63599 1.00000 3723G 2362G 1361G 63.44 1.04 204
osd.66
67 hdd 3.63599 1.00000 3723G 2497G 1225G 67.08 1.10 214
osd.67
68 hdd 3.63599 1.00000 3723G 1291G 2432G 34.68 0.57 157
osd.68
69 hdd 3.63599 1.00000 3723G 2325G 1397G 62.46 1.02 192
osd.69
70 hdd 3.63599 1.00000 3723G 2590G 1133G 69.57 1.14 210
osd.70
71 hdd 3.63599 1.00000 3723G 2451G 1272G 65.83 1.08 207
osd.71
-18 43.63699 - 44684G 28525G 16158G 63.84 1.04 -
rack rack07
-17 43.63699 - 44684G 28525G 16158G 63.84 1.04 -
host host07
72 3.63599 1.00000 3723G 2415G 1307G 64.88 1.06 197
osd.72
73 3.63599 1.00000 3723G 2216G 1507G 59.52 0.97 188
osd.73
74 3.63599 1.00000 3723G 2421G 1301G 65.04 1.06 220
osd.74
75 3.63599 1.00000 3723G 2642G 1080G 70.97 1.16 207
osd.75
76 3.63599 1.00000 3723G 2307G 1416G 61.97 1.01 204
osd.76
77 3.63599 1.00000 3723G 2341G 1381G 62.89 1.03 200
osd.77
78 3.63599 1.00000 3723G 2312G 1410G 62.12 1.01 188
osd.78
79 3.63599 1.00000 3723G 2166G 1557G 58.18 0.95 183
osd.79
80 3.63599 1.00000 3723G 2458G 1264G 66.03 1.08 221
osd.80
81 3.63599 1.00000 3723G 2126G 1596G 57.12 0.93 182
osd.81
82 3.63599 1.00000 3723G 2442G 1281G 65.58 1.07 189
osd.82
83 3.63599 1.00000 3723G 2672G 1051G 71.76 1.17 214
osd.83
-20 87.27377 - 89368G 50955G 38412G 57.02 0.93 -
rack rack08
-19 43.63699 - 44684G 25780G 18903G 57.70 0.94 -
host host08
84 3.63599 1.00000 3723G 2084G 1638G 55.99 0.91 179
osd.84
85 3.63599 1.00000 3723G 2121G 1602G 56.96 0.93 193
osd.85
86 3.63599 1.00000 3723G 2131G 1591G 57.25 0.93 186
osd.86
87 3.63599 1.00000 3723G 2343G 1380G 62.94 1.03 200
osd.87
88 3.63599 1.00000 3723G 2120G 1603G 56.94 0.93 166
osd.88
89 3.63599 1.00000 3723G 2175G 1548G 58.43 0.95 161
osd.89
90 3.63599 1.00000 3723G 1974G 1749G 53.02 0.87 184
osd.90
91 3.63599 1.00000 3723G 1942G 1781G 52.17 0.85 171
osd.91
92 3.63599 1.00000 3723G 2281G 1442G 61.27 1.00 169
osd.92
93 3.63599 1.00000 3723G 2103G 1620G 56.48 0.92 190
osd.93
94 3.63599 1.00000 3723G 2326G 1397G 62.48 1.02 206
osd.94
95 3.63599 1.00000 3723G 2175G 1548G 58.42 0.95 186
osd.95
-21 43.63678 - 44684G 25175G 19508G 56.34 0.92 -
host host09
96 hdd 3.63640 1.00000 3723G 2434G 1289G 65.38 1.07 191
osd.96
97 hdd 3.63640 1.00000 3723G 2175G 1547G 58.44 0.95 183
osd.97
98 hdd 3.63640 1.00000 3723G 1771G 1951G 47.59 0.78 171
osd.98
99 hdd 3.63640 1.00000 3723G 2433G 1290G 65.34 1.07 201
osd.99
100 hdd 3.63640 1.00000 3723G 2098G 1625G 56.35 0.92 175
osd.100
101 hdd 3.63640 1.00000 3723G 2175G 1547G 58.43 0.95 188
osd.101
102 hdd 3.63640 1.00000 3723G 1741G 1982G 46.77 0.76 172
osd.102
103 hdd 3.63640 1.00000 3723G 2123G 1600G 57.02 0.93 173
osd.103
104 hdd 3.63640 1.00000 3723G 1967G 1755G 52.85 0.86 177
osd.104
105 hdd 3.63640 1.00000 3723G 2283G 1440G 61.32 1.00 162
osd.105
106 hdd 3.63640 1.00000 3723G 2021G 1702G 54.29 0.89 184
osd.106
107 hdd 3.63640 1.00000 3723G 1947G 1776G 52.30 0.85 194
osd.107
TOTAL 385T 236T 149T 61.24

MIN/MAX VAR: 0.57/1.32 STDDEV: 7.11


Output of "ceph osd perf" when "ceph status" shows traffic like "client:
679 MB/s rd, 186 MB/s wr, 1200 op/s rd, 790 op/s wr". Most of this traffic
goes via RGW.

osd commit_latency(ms) apply_latency(ms)
107 0 22
106 0 1
105 0 0
104 0 22
103 0 13
102 0 8
101 0 2
100 0 28
99 0 25
98 0 542
97 0 418
45 0 23
44 0 26
43 0 1
42 0 9
41 0 27
40 0 0
39 0 1189
38 0 9
37 0 4
36 0 625
35 0 43
34 0 4
33 0 143
32 0 8
31 0 1
30 0 21
29 0 33
28 0 11
27 0 40
26 0 7
25 0 6
24 0 20
23 0 5
22 0 6
9 0 2
8 0 71
7 0 2
6 0 35
5 0 549
4 0 6
0 0 861
1 0 15
2 0 19
3 0 2
10 0 4
11 0 378
12 0 19
13 0 1
14 0 6
15 0 4
16 0 1040
17 0 7
18 0 8
19 0 185
20 0 32
21 0 855
46 0 32
47 0 4
48 0 7
49 0 4
50 0 7
51 0 18
53 0 3
54 0 87
55 0 4
56 0 0
57 0 1
58 0 49
59 0 6
60 0 5
61 0 19
62 0 67
63 0 0
64 0 423
65 0 26
66 0 30
67 0 74
68 0 6
69 0 4
70 0 48
71 0 1900
72 0 24
73 0 7
74 0 379
75 0 4
76 0 9
77 0 19
78 0 14
79 0 135
80 0 1736
81 0 6
82 0 2504
83 0 24
84 0 8
85 0 10
86 0 13
87 0 142
88 0 68
89 0 28
90 0 3
91 0 14
92 0 14
93 0 12
94 0 47
95 0 39
96 0 8


All OSDs report slow requests

# zgrep 'slow request' /var/log/ceph/ceph.log* | egrep -io ' osd\.[0-9]*'
|sort |uniq -c
1538 osd.0
1292 osd.1
752 osd.10
640 osd.100
10456 osd.101
2058 osd.102
2174 osd.103
1898 osd.104
1556 osd.105
2332 osd.106
786 osd.107
2408 osd.11
5222 osd.12
224 osd.13
2858 osd.14
238 osd.15
1880 osd.16
566 osd.17
120 osd.18
284 osd.19
896 osd.2
1840 osd.20
1466 osd.21
1218 osd.22
1066 osd.23
2470 osd.24
1640 osd.25
3584 osd.26
10912 osd.27
2954 osd.28
1736 osd.29
570 osd.3
1266 osd.30
298 osd.31
30 osd.32
4528 osd.33
4260 osd.34
4542 osd.35
374 osd.36
2820 osd.37
158 osd.38
748 osd.39
26816 osd.4
3378 osd.40
628 osd.41
16358 osd.42
454 osd.43
2798 osd.44
11378 osd.45
2280 osd.46
1374 osd.47
1660 osd.48
1122 osd.49
174 osd.5
1212 osd.50
1542 osd.51
2366 osd.53
2930 osd.54
196 osd.55
2008 osd.57
1796 osd.58
1644 osd.59
12974 osd.6
208 osd.60
644 osd.61
1230 osd.62
1018 osd.63
1446 osd.64
430 osd.65
512 osd.66
1782 osd.67
96 osd.68
1054 osd.69
596 osd.7
760 osd.70
1160 osd.71
648 osd.72
638 osd.73
752 osd.74
2196 osd.75
568 osd.76
1540 osd.77
944 osd.78
400 osd.79
516 osd.8
2494 osd.80
152 osd.81
1528 osd.82
846 osd.83
1218 osd.84
676 osd.85
1500 osd.86
630 osd.87
618 osd.88
106 osd.89
282 osd.9
972 osd.90
2442 osd.91
1056 osd.92
996 osd.93
1366 osd.94
1112 osd.95
474 osd.96
990 osd.97
342 osd.98
4906 osd.99

Example log

2018-01-30 10:31:39.865054 mon.host01 mon.0 10.212.32.18:6789/0 695886 :
cluster [WRN] Health check update: 57 slow requests are blocked > 32 sec
(REQUEST_SLOW)
2018-01-30 10:31:44.865536 mon.host01 mon.0 10.212.32.18:6789/0 695887 :
cluster [WRN] Health check update: 56 slow requests are blocked > 32 sec
(REQUEST_SLOW)
2018-01-30 10:31:43.408679 osd.85 osd.85 10.212.32.25:6810/142783 4357 :
cluster [WRN] 56 slow requests, 2 included below; oldest blocked for >
54.329588 secs
2018-01-30 10:31:43.408683 osd.85 osd.85 10.212.32.25:6810/142783 4358 :
cluster [WRN] slow request 30.391140 seconds old, received at 2018-01-30
10:31:13.017492: osd_op(client.612334.0:119664853 3.3cca000a 3.3cca000a
(undecoded) ack+ondisk+write+known_if_redirected e5269) currently
queued_for_pg
2018-01-30 10:31:43.408686 osd.85 osd.85 10.212.32.25:6810/142783 4359 :
cluster [WRN] slow request 30.359503 seconds old, received at 2018-01-30
10:31:13.049129: osd_op(client.612334.0:119664855 3.3cca000a 3.3cca000a
(undecoded) ack+ondisk+write+known_if_redirected e5269) currently
queued_for_pg
2018-01-30 10:31:50.891971 mon.host01 mon.0 10.212.32.18:6789/0 695888 :
cluster [INF] Health check cleared: REQUEST_SLOW (was: 56 slow requests are
blocked > 32 sec)
2018-01-30 10:31:50.892042 mon.host01 mon.0 10.212.32.18:6789/0 695889 :
cluster [INF] Cluster is now healthy
2018-01-30 10:31:52.947008 mon.host01 mon.0 10.212.32.18:6789/0 695890 :
cluster [WRN] Health check failed: 1 slow requests are blocked > 32 sec
(REQUEST_SLOW)
2018-01-30 10:31:54.125045 mon.host01 mon.0 10.212.32.18:6789/0 695891 :
cluster [INF] mon.1 10.212.32.19:6789/0
2018-01-30 10:31:54.125098 mon.host01 mon.0 10.212.32.18:6789/0 695892 :
cluster [INF] mon.2 10.212.32.20:6789/0
2018-01-30 10:31:56.146267 mon.host01 mon.0 10.212.32.18:6789/0 695893 :
cluster [WRN] overall HEALTH_WARN 1 slow requests are blocked > 32 sec
2018-01-30 10:31:59.866730 mon.host01 mon.0 10.212.32.18:6789/0 695894 :
cluster [WRN] Health check update: 3 slow requests are blocked > 32 sec
(REQUEST_SLOW)
2018-01-30 10:32:04.867163 mon.host01 mon.0 10.212.32.18:6789/0 695896 :
cluster [WRN] Health check update: 6 slow requests are blocked > 32 sec
(REQUEST_SLOW)
2018-01-30 10:32:05.722529 osd.11 osd.11 10.212.32.18:6812/94831 5792 :
cluster [WRN] 3 slow requests, 3 included below; oldest blocked for >
30.713577 secs
2018-01-30 10:32:05.722536 osd.11 osd.11 10.212.32.18:6812/94831 5793 :
cluster [WRN] slow request 30.023513 seconds old, received at 2018-01-30
10:31:35.698959: osd_op(client.858110.0:109814110 3.6a6d3d03 3.6a6d3d03
(undecoded) ack+ondisk+write+known_if_redirected e5269) currently
queued_for_pg
2018-01-30 10:32:05.722539 osd.11 osd.11 10.212.32.18:6812/94831 5794 :
cluster [WRN] slow request 30.188061 seconds old, received at 2018-01-30
10:31:35.534411: osd_op(client.858110.0:109814108 3.6a6d3d03 3.6a6d3d03
(undecoded) ack+ondisk+write+known_if_redirected e5269) currently
queued_for_pg
2018-01-30 10:32:05.722542 osd.11 osd.11 10.212.32.18:6812/94831 5795 :
cluster [WRN] slow request 30.713577 seconds old, received at 2018-01-30
10:31:35.008895: osd_op(client.864319.0:1526442 20.21bs0 20.da47561b
(undecoded) ondisk+read+known_if_redirected e5269) currently queued_for_pg
2018-01-30 10:32:08.722925 osd.11 osd.11 10.212.32.18:6812/94831 5796 :
cluster [WRN] 4 slow requests, 1 included below; oldest blocked for >
33.713980 secs
2018-01-30 10:32:08.722930 osd.11 osd.11 10.212.32.18:6812/94831 5797 :
cluster [WRN] slow request 30.132129 seconds old, received at 2018-01-30
10:31:38.590746: osd_op(client.864319.0:1526720 20.21bs0 20.2af5361b
(undecoded) ondisk+read+known_if_redirected e5269) currently queued_for_pg
2018-01-30 10:32:11.723286 osd.11 osd.11 10.212.32.18:6812/94831 5798 :
cluster [WRN] 5 slow requests, 1 included below; oldest blocked for >
36.714352 secs
2018-01-30 10:32:11.723290 osd.11 osd.11 10.212.32.18:6812/94831 5799 :
cluster [WRN] slow request 30.963475 seconds old, received at 2018-01-30
10:31:40.759772: osd_op(client.834441.0:1569804 20.21bs0 20.cf3c321b
(undecoded) ondisk+read+known_if_redirected e5269) currently queued_for_pg
2018-01-30 10:32:13.723602 osd.11 osd.11 10.212.32.18:6812/94831 5800 :
cluster [WRN] 6 slow requests, 1 included below; oldest blocked for >
38.714655 secs
2018-01-30 10:32:13.723607 osd.11 osd.11 10.212.32.18:6812/94831 5801 :
cluster [WRN] slow request 30.933479 seconds old, received at 2018-01-30
10:31:42.790071: osd_op(client.612286.0:59425229 3.e2431d03 3.e2431d03
(undecoded) ack+ondisk+write+known_if_redirected e5269) currently
queued_for_pg
2018-01-30 10:32:23.857767 mon.host01 mon.0 10.212.32.18:6789/0 695897 :
cluster [INF] Health check cleared: REQUEST_SLOW (was: 6 slow requests are
blocked > 32 sec)
2018-01-30 10:32:23.857841 mon.host01 mon.0 10.212.32.18:6789/0 695898 :
cluster [INF] Cluster is now healthy
2018-01-30 10:32:25.105695 mon.host01 mon.0 10.212.32.18:6789/0 695899 :
cluster [WRN] Health check failed: 4 slow requests are blocked > 32 sec
(REQUEST_SLOW)
2018-01-30 10:32:34.869380 mon.host01 mon.0 10.212.32.18:6789/0 695900 :
cluster [WRN] Health check update: 6 slow requests are blocked > 32 sec
(REQUEST_SLOW)
2018-01-30 10:32:30.812261 osd.97 osd.97 10.212.32.26:6822/9220 3691 :
cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
30.393065 secs
2018-01-30 10:32:30.812265 osd.97 osd.97 10.212.32.26:6822/9220 3692 :
cluster [WRN] slow request 30.393065 seconds old, received at 2018-01-30
10:32:00.419168: osd_op(client.1010840.0:1005983 20.13fs0 20.4245d13f
(undecoded) ondisk+read+known_if_redirected e5269) currently queued_for_pg


Any not default ceph.conf options to make better use of cluster
configuration?
So far we increased value for "rgw thread pool size = 256"

What throughput can I expect from above configuration ? I see peaks like
2GBps reads and 100MBps writes.

Thanks
Jakub
Jakub Jaszewski
2018-01-31 12:37:25 UTC
Permalink
Raw Message
​​
Hi,

I'm wondering why slow requests are being reported mainly when the request
has been put into the queue for processing by its PG (queued_for_pg ,
http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-osd/#debugging-slow-request
).
Could it be due too low pg_num/pgp_num ?

It looks that slow requests are mainly addressed to
default.rgw.buckets.data (pool id 20) , volumes (pool id 3) and
default.rgw.buckets.index (pool id 14)

2018-01-31 12:06:55.899557 osd.59 osd.59 10.212.32.22:6806/4413 38 :
cluster [WRN] slow request 30.125793 seconds old, received at 2018-01-31
12:06:25.773675: osd_op(client.857003.0:126171692 3.a4fec1ad 3.a4fec1ad
(undecoded) ack+ondisk+write+known_if_redirected e5722) currently
queued_for_pg

Btw how can I get more human-friendly client information from log entry
like above ?

Current pg_num/pgp_num

​
pool 3 'volumes' replicated size 3 min_size 2 crush_rule 0 object_hash
rjenkins pg_num 1024 pgp_num 1024 last_change 4502 flags hashpspool
stripe_width 0 application rbd
removed_snaps [1~3]
pool 14 'default.rgw.buckets.index' replicated size 3 min_size 2 crush_rule
0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 4502 flags hashpspool
stripe_width 0 application rgw
pool 20 'default.rgw.buckets.data' erasure size 9 min_size 6 crush_rule 1
object_hash rjenkins pg_num 1024 pgp_num 1024 last_change 4502 flags
hashpspool stripe_width 4224 application rgw

Usage

GLOBAL:
SIZE AVAIL RAW USED %RAW USED OBJECTS
385T 144T 241T 62.54 31023k
POOLS:
NAME ID QUOTA OBJECTS QUOTA BYTES
USED %USED MAX AVAIL OBJECTS DIRTY READ
WRITE RAW USED
volumes 3 N/A N/A
40351G 70.91 16557G 10352314 10109k 2130M
2520M 118T
default.rgw.buckets.index 14 N/A N/A
0 0 16557G 205 205 160M
27945k 0
default.rgw.buckets.data 20 N/A N/A
79190G 70.51 33115G 20865953 20376k 122M
113M 116T



# ceph osd pool ls detail
pool 0 'rbd' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins
pg_num 64 pgp_num 64 last_change 4502 flags hashpspool stripe_width 0
application rbd
pool 1 'vms' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins
pg_num 1024 pgp_num 1024 last_change 4502 flags hashpspool stripe_width 0
application rbd
pool 2 'images' replicated size 3 min_size 2 crush_rule 0 object_hash
rjenkins pg_num 512 pgp_num 512 last_change 5175 flags hashpspool
stripe_width 0 application rbd
removed_snaps [1~7,14~2]
pool 3 'volumes' replicated size 3 min_size 2 crush_rule 0 object_hash
rjenkins pg_num 1024 pgp_num 1024 last_change 4502 flags hashpspool
stripe_width 0 application rbd
removed_snaps [1~3]
pool 4 '.rgw.root' replicated size 3 min_size 2 crush_rule 0 object_hash
rjenkins pg_num 8 pgp_num 8 last_change 4502 flags hashpspool stripe_width
0 application rgw
pool 5 'default.rgw.control' replicated size 3 min_size 2 crush_rule 0
object_hash rjenkins pg_num 8 pgp_num 8 last_change 4502 flags hashpspool
stripe_width 0 application rgw
pool 6 'default.rgw.data.root' replicated size 3 min_size 2 crush_rule 0
object_hash rjenkins pg_num 8 pgp_num 8 last_change 4502 flags hashpspool
stripe_width 0 application rgw
pool 7 'default.rgw.gc' replicated size 3 min_size 2 crush_rule 0
object_hash rjenkins pg_num 8 pgp_num 8 last_change 4502 flags hashpspool
stripe_width 0 application rgw
pool 8 'default.rgw.log' replicated size 3 min_size 2 crush_rule 0
object_hash rjenkins pg_num 8 pgp_num 8 last_change 4502 flags hashpspool
stripe_width 0 application rgw
pool 9 'default.rgw.users.uid' replicated size 3 min_size 2 crush_rule 0
object_hash rjenkins pg_num 8 pgp_num 8 last_change 4502 flags hashpspool
stripe_width 0 application rgw
pool 10 'default.rgw.usage' replicated size 3 min_size 2 crush_rule 0
object_hash rjenkins pg_num 8 pgp_num 8 last_change 4502 flags hashpspool
stripe_width 0 application rgw
pool 11 'default.rgw.users.email' replicated size 3 min_size 2 crush_rule 0
object_hash rjenkins pg_num 8 pgp_num 8 last_change 4502 owner
18446744073709551615 flags hashpspool stripe_width 0 application rgw
pool 12 'default.rgw.users.keys' replicated size 3 min_size 2 crush_rule 0
object_hash rjenkins pg_num 8 pgp_num 8 last_change 4502 owner
18446744073709551615 flags hashpspool stripe_width 0 application rgw
pool 13 'default.rgw.users.swift' replicated size 3 min_size 2 crush_rule 0
object_hash rjenkins pg_num 8 pgp_num 8 last_change 4502 flags hashpspool
stripe_width 0 application rgw
pool 14 'default.rgw.buckets.index' replicated size 3 min_size 2 crush_rule
0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 4502 flags hashpspool
stripe_width 0 application rgw
pool 15 'default.rgw.buckets.data.old' replicated size 3 min_size 2
crush_rule 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 4502
flags hashpspool stripe_width 0 application rgw
pool 16 'default.rgw.buckets.non-ec' replicated size 3 min_size 2
crush_rule 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 4502 flags
hashpspool stripe_width 0 application rgw
pool 17 'default.rgw.buckets.extra' replicated size 3 min_size 2 crush_rule
0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 4502 flags hashpspool
stripe_width 0 application rgw
pool 18 '.rgw.buckets.extra' replicated size 3 min_size 2 crush_rule 0
object_hash rjenkins pg_num 8 pgp_num 8 last_change 4502 flags hashpspool
stripe_width 0 application rgw
pool 20 'default.rgw.buckets.data' erasure size 9 min_size 6 crush_rule 1
object_hash rjenkins pg_num 1024 pgp_num 1024 last_change 4502 flags
hashpspool stripe_width 4224 application rgw
pool 21 'benchmark_replicated' replicated size 3 min_size 2 crush_rule 0
object_hash rjenkins pg_num 1024 pgp_num 1024 last_change 4550 flags
hashpspool stripe_width 0 application rbd
removed_snaps [1~3]
pool 22 'benchmark_erasure_coded' erasure size 9 min_size 7 crush_rule 1
object_hash rjenkins pg_num 32 pgp_num 32 last_change 4552 flags hashpspool
stripe_width 24576 application rbd
removed_snaps [1~3]


# ceph df detail
GLOBAL:
SIZE AVAIL RAW USED %RAW USED OBJECTS
385T 144T 241T 62.54 31023k
POOLS:
NAME ID QUOTA OBJECTS QUOTA BYTES
USED %USED MAX AVAIL OBJECTS DIRTY READ
WRITE RAW USED
rbd 0 N/A N/A
0 0 16557G 0 0 1
134k 0
vms 1 N/A N/A
0 0 16557G 0 0 0
0 0
images 2 N/A N/A
7659M 0.05 16557G 1022 1022 51247
5668 22977M
volumes 3 N/A N/A
40351G 70.91 16557G 10352314 10109k 2130M
2520M 118T
.rgw.root 4 N/A N/A
1588 0 16557G 4 4 90
4 4764
default.rgw.control 5 N/A N/A
0 0 16557G 8 8 0
0 0
default.rgw.data.root 6 N/A N/A
93943 0 16557G 336 336 239k
6393 275k
default.rgw.gc 7 N/A N/A
0 0 16557G 32 32 1773M
5281k 0
default.rgw.log 8 N/A N/A
0 0 16557G 185 185 22404k
14936k 0
default.rgw.users.uid 9 N/A N/A
3815 0 16557G 15 15 187k
53303 11445
default.rgw.usage 10 N/A N/A
0 0 16557G 7 7 278k
556k 0
default.rgw.users.email 11 N/A N/A
58 0 16557G 3 3 0
3 174
default.rgw.users.keys 12 N/A N/A
177 0 16557G 10 10 262
22 531
default.rgw.users.swift 13 N/A N/A
40 0 16557G 3 3 0
3 120
default.rgw.buckets.index 14 N/A N/A
0 0 16557G 205 205 160M
27945k 0
default.rgw.buckets.data.old 15 N/A N/A
668G 3.88 16557G 180867 176k 707k
2318k 2004G
default.rgw.buckets.non-ec 16 N/A N/A
0 0 16557G 114 114 17960
12024 0
default.rgw.buckets.extra 17 N/A N/A
0 0 16557G 0 0 0
0 0
.rgw.buckets.extra 18 N/A N/A
0 0 16557G 0 0 0
0 0
default.rgw.buckets.data 20 N/A N/A
79190G 70.51 33115G 20865953 20376k 122M
113M 116T
benchmark_replicated 21 N/A N/A
1415G 7.88 16557G 363800 355k 1338k
1251k 4247G
benchmark_erasure_coded 22 N/A N/A
11057M 0.03 33115G 2761 2761 398
5520 16586M


Thanks
Jakub
Jakub Jaszewski
2018-01-31 14:26:38 UTC
Permalink
Raw Message
Is it safe to increase pg_num and pgp_num from 1024 up to 2048 for volumes
and default.rgw.buckets.data pools?
How will it impact cluster behavior? I guess cluster rebalancing will occur
and will take long time considering amount of data we have on it ?

Regards
Jakub
Post by Jakub Jaszewski
​​
Hi,
I'm wondering why slow requests are being reported mainly when the request
has been put into the queue for processing by its PG (queued_for_pg ,
http://docs.ceph.com/docs/master/rados/troubleshooting/
troubleshooting-osd/#debugging-slow-request).
Could it be due too low pg_num/pgp_num ?
It looks that slow requests are mainly addressed to
default.rgw.buckets.data (pool id 20) , volumes (pool id 3) and
default.rgw.buckets.index (pool id 14)
cluster [WRN] slow request 30.125793 seconds old, received at 2018-01-31
12:06:25.773675: osd_op(client.857003.0:126171692 3.a4fec1ad 3.a4fec1ad
(undecoded) ack+ondisk+write+known_if_redirected e5722) currently
queued_for_pg
Btw how can I get more human-friendly client information from log entry
like above ?
Current pg_num/pgp_num
​
pool 3 'volumes' replicated size 3 min_size 2 crush_rule 0 object_hash
rjenkins pg_num 1024 pgp_num 1024 last_change 4502 flags hashpspool
stripe_width 0 application rbd
removed_snaps [1~3]
pool 14 'default.rgw.buckets.index' replicated size 3 min_size 2
crush_rule 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 4502 flags
hashpspool stripe_width 0 application rgw
pool 20 'default.rgw.buckets.data' erasure size 9 min_size 6 crush_rule 1
object_hash rjenkins pg_num 1024 pgp_num 1024 last_change 4502 flags
hashpspool stripe_width 4224 application rgw
Usage
SIZE AVAIL RAW USED %RAW USED OBJECTS
385T 144T 241T 62.54 31023k
NAME ID QUOTA OBJECTS QUOTA BYTES
USED %USED MAX AVAIL OBJECTS DIRTY READ
WRITE RAW USED
volumes 3 N/A N/A
40351G 70.91 16557G 10352314 10109k 2130M
2520M 118T
default.rgw.buckets.index 14 N/A N/A
0 0 16557G 205 205 160M
27945k 0
default.rgw.buckets.data 20 N/A N/A
79190G 70.51 33115G 20865953 20376k 122M
113M 116T
# ceph osd pool ls detail
pool 0 'rbd' replicated size 3 min_size 2 crush_rule 0 object_hash
rjenkins pg_num 64 pgp_num 64 last_change 4502 flags hashpspool
stripe_width 0 application rbd
pool 1 'vms' replicated size 3 min_size 2 crush_rule 0 object_hash
rjenkins pg_num 1024 pgp_num 1024 last_change 4502 flags hashpspool
stripe_width 0 application rbd
pool 2 'images' replicated size 3 min_size 2 crush_rule 0 object_hash
rjenkins pg_num 512 pgp_num 512 last_change 5175 flags hashpspool
stripe_width 0 application rbd
removed_snaps [1~7,14~2]
pool 3 'volumes' replicated size 3 min_size 2 crush_rule 0 object_hash
rjenkins pg_num 1024 pgp_num 1024 last_change 4502 flags hashpspool
stripe_width 0 application rbd
removed_snaps [1~3]
pool 4 '.rgw.root' replicated size 3 min_size 2 crush_rule 0 object_hash
rjenkins pg_num 8 pgp_num 8 last_change 4502 flags hashpspool stripe_width
0 application rgw
pool 5 'default.rgw.control' replicated size 3 min_size 2 crush_rule 0
object_hash rjenkins pg_num 8 pgp_num 8 last_change 4502 flags hashpspool
stripe_width 0 application rgw
pool 6 'default.rgw.data.root' replicated size 3 min_size 2 crush_rule 0
object_hash rjenkins pg_num 8 pgp_num 8 last_change 4502 flags hashpspool
stripe_width 0 application rgw
pool 7 'default.rgw.gc' replicated size 3 min_size 2 crush_rule 0
object_hash rjenkins pg_num 8 pgp_num 8 last_change 4502 flags hashpspool
stripe_width 0 application rgw
pool 8 'default.rgw.log' replicated size 3 min_size 2 crush_rule 0
object_hash rjenkins pg_num 8 pgp_num 8 last_change 4502 flags hashpspool
stripe_width 0 application rgw
pool 9 'default.rgw.users.uid' replicated size 3 min_size 2 crush_rule 0
object_hash rjenkins pg_num 8 pgp_num 8 last_change 4502 flags hashpspool
stripe_width 0 application rgw
pool 10 'default.rgw.usage' replicated size 3 min_size 2 crush_rule 0
object_hash rjenkins pg_num 8 pgp_num 8 last_change 4502 flags hashpspool
stripe_width 0 application rgw
pool 11 'default.rgw.users.email' replicated size 3 min_size 2 crush_rule
0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 4502 owner
18446744073709551615 flags hashpspool stripe_width 0 application rgw
pool 12 'default.rgw.users.keys' replicated size 3 min_size 2 crush_rule 0
object_hash rjenkins pg_num 8 pgp_num 8 last_change 4502 owner
18446744073709551615 flags hashpspool stripe_width 0 application rgw
pool 13 'default.rgw.users.swift' replicated size 3 min_size 2 crush_rule
0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 4502 flags hashpspool
stripe_width 0 application rgw
pool 14 'default.rgw.buckets.index' replicated size 3 min_size 2
crush_rule 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 4502 flags
hashpspool stripe_width 0 application rgw
pool 15 'default.rgw.buckets.data.old' replicated size 3 min_size 2
crush_rule 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 4502
flags hashpspool stripe_width 0 application rgw
pool 16 'default.rgw.buckets.non-ec' replicated size 3 min_size 2
crush_rule 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 4502 flags
hashpspool stripe_width 0 application rgw
pool 17 'default.rgw.buckets.extra' replicated size 3 min_size 2
crush_rule 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 4502 flags
hashpspool stripe_width 0 application rgw
pool 18 '.rgw.buckets.extra' replicated size 3 min_size 2 crush_rule 0
object_hash rjenkins pg_num 8 pgp_num 8 last_change 4502 flags hashpspool
stripe_width 0 application rgw
pool 20 'default.rgw.buckets.data' erasure size 9 min_size 6 crush_rule 1
object_hash rjenkins pg_num 1024 pgp_num 1024 last_change 4502 flags
hashpspool stripe_width 4224 application rgw
pool 21 'benchmark_replicated' replicated size 3 min_size 2 crush_rule 0
object_hash rjenkins pg_num 1024 pgp_num 1024 last_change 4550 flags
hashpspool stripe_width 0 application rbd
removed_snaps [1~3]
pool 22 'benchmark_erasure_coded' erasure size 9 min_size 7 crush_rule 1
object_hash rjenkins pg_num 32 pgp_num 32 last_change 4552 flags hashpspool
stripe_width 24576 application rbd
removed_snaps [1~3]
# ceph df detail
SIZE AVAIL RAW USED %RAW USED OBJECTS
385T 144T 241T 62.54 31023k
NAME ID QUOTA OBJECTS QUOTA BYTES
USED %USED MAX AVAIL OBJECTS DIRTY READ
WRITE RAW USED
rbd 0 N/A N/A
0 0 16557G 0 0 1
134k 0
vms 1 N/A N/A
0 0 16557G 0 0 0
0 0
images 2 N/A N/A
7659M 0.05 16557G 1022 1022 51247
5668 22977M
volumes 3 N/A N/A
40351G 70.91 16557G 10352314 10109k 2130M
2520M 118T
.rgw.root 4 N/A N/A
1588 0 16557G 4 4 90
4 4764
default.rgw.control 5 N/A N/A
0 0 16557G 8 8 0
0 0
default.rgw.data.root 6 N/A N/A
93943 0 16557G 336 336 239k
6393 275k
default.rgw.gc 7 N/A N/A
0 0 16557G 32 32 1773M
5281k 0
default.rgw.log 8 N/A N/A
0 0 16557G 185 185 22404k
14936k 0
default.rgw.users.uid 9 N/A N/A
3815 0 16557G 15 15 187k
53303 11445
default.rgw.usage 10 N/A N/A
0 0 16557G 7 7 278k
556k 0
default.rgw.users.email 11 N/A N/A
58 0 16557G 3 3 0
3 174
default.rgw.users.keys 12 N/A N/A
177 0 16557G 10 10 262
22 531
default.rgw.users.swift 13 N/A N/A
40 0 16557G 3 3 0
3 120
default.rgw.buckets.index 14 N/A N/A
0 0 16557G 205 205 160M
27945k 0
default.rgw.buckets.data.old 15 N/A N/A
668G 3.88 16557G 180867 176k 707k
2318k 2004G
default.rgw.buckets.non-ec 16 N/A N/A
0 0 16557G 114 114 17960
12024 0
default.rgw.buckets.extra 17 N/A N/A
0 0 16557G 0 0 0
0 0
.rgw.buckets.extra 18 N/A N/A
0 0 16557G 0 0 0
0 0
default.rgw.buckets.data 20 N/A N/A
79190G 70.51 33115G 20865953 20376k 122M
113M 116T
benchmark_replicated 21 N/A N/A
1415G 7.88 16557G 363800 355k 1338k
1251k 4247G
benchmark_erasure_coded 22 N/A N/A
11057M 0.03 33115G 2761 2761 398
5520 16586M
Thanks
Jakub
Luis Periquito
2018-01-31 14:59:26 UTC
Permalink
Raw Message
on a cursory look of the information it seems the cluster is
overloaded with the requests.

Just a guess, but if you look at IO usage on those spindles they'll be
at or around 100% usage most of the time.

If that is the case then increasing the pg_num and pgp_num won't help,
and short term, will make it worse.

Metadata pools (like default.rgw.buckets.index) really excel in a SSD
pool, even if small. I carved a small OSD in the journal SSDs for
those kinds of workloads.

On Wed, Jan 31, 2018 at 2:26 PM, Jakub Jaszewski
Post by Jakub Jaszewski
Is it safe to increase pg_num and pgp_num from 1024 up to 2048 for volumes
and default.rgw.buckets.data pools?
How will it impact cluster behavior? I guess cluster rebalancing will occur
and will take long time considering amount of data we have on it ?
Regards
Jakub
Post by Jakub Jaszewski
Hi,
I'm wondering why slow requests are being reported mainly when the request
has been put into the queue for processing by its PG (queued_for_pg ,
http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-osd/#debugging-slow-request).
Could it be due too low pg_num/pgp_num ?
It looks that slow requests are mainly addressed to
default.rgw.buckets.data (pool id 20) , volumes (pool id 3) and
default.rgw.buckets.index (pool id 14)
cluster [WRN] slow request 30.125793 seconds old, received at 2018-01-31
12:06:25.773675: osd_op(client.857003.0:126171692 3.a4fec1ad 3.a4fec1ad
(undecoded) ack+ondisk+write+known_if_redirected e5722) currently
queued_for_pg
Btw how can I get more human-friendly client information from log entry
like above ?
Current pg_num/pgp_num
pool 3 'volumes' replicated size 3 min_size 2 crush_rule 0 object_hash
rjenkins pg_num 1024 pgp_num 1024 last_change 4502 flags hashpspool
stripe_width 0 application rbd
removed_snaps [1~3]
pool 14 'default.rgw.buckets.index' replicated size 3 min_size 2
crush_rule 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 4502 flags
hashpspool stripe_width 0 application rgw
pool 20 'default.rgw.buckets.data' erasure size 9 min_size 6 crush_rule 1
object_hash rjenkins pg_num 1024 pgp_num 1024 last_change 4502 flags
hashpspool stripe_width 4224 application rgw
Usage
SIZE AVAIL RAW USED %RAW USED OBJECTS
385T 144T 241T 62.54 31023k
NAME ID QUOTA OBJECTS QUOTA BYTES
USED %USED MAX AVAIL OBJECTS DIRTY READ WRITE
RAW USED
volumes 3 N/A N/A
40351G 70.91 16557G 10352314 10109k 2130M 2520M
118T
default.rgw.buckets.index 14 N/A N/A
0 0 16557G 205 205 160M 27945k
0
default.rgw.buckets.data 20 N/A N/A
79190G 70.51 33115G 20865953 20376k 122M 113M
116T
# ceph osd pool ls detail
pool 0 'rbd' replicated size 3 min_size 2 crush_rule 0 object_hash
rjenkins pg_num 64 pgp_num 64 last_change 4502 flags hashpspool stripe_width
0 application rbd
pool 1 'vms' replicated size 3 min_size 2 crush_rule 0 object_hash
rjenkins pg_num 1024 pgp_num 1024 last_change 4502 flags hashpspool
stripe_width 0 application rbd
pool 2 'images' replicated size 3 min_size 2 crush_rule 0 object_hash
rjenkins pg_num 512 pgp_num 512 last_change 5175 flags hashpspool
stripe_width 0 application rbd
removed_snaps [1~7,14~2]
pool 3 'volumes' replicated size 3 min_size 2 crush_rule 0 object_hash
rjenkins pg_num 1024 pgp_num 1024 last_change 4502 flags hashpspool
stripe_width 0 application rbd
removed_snaps [1~3]
pool 4 '.rgw.root' replicated size 3 min_size 2 crush_rule 0 object_hash
rjenkins pg_num 8 pgp_num 8 last_change 4502 flags hashpspool stripe_width 0
application rgw
pool 5 'default.rgw.control' replicated size 3 min_size 2 crush_rule 0
object_hash rjenkins pg_num 8 pgp_num 8 last_change 4502 flags hashpspool
stripe_width 0 application rgw
pool 6 'default.rgw.data.root' replicated size 3 min_size 2 crush_rule 0
object_hash rjenkins pg_num 8 pgp_num 8 last_change 4502 flags hashpspool
stripe_width 0 application rgw
pool 7 'default.rgw.gc' replicated size 3 min_size 2 crush_rule 0
object_hash rjenkins pg_num 8 pgp_num 8 last_change 4502 flags hashpspool
stripe_width 0 application rgw
pool 8 'default.rgw.log' replicated size 3 min_size 2 crush_rule 0
object_hash rjenkins pg_num 8 pgp_num 8 last_change 4502 flags hashpspool
stripe_width 0 application rgw
pool 9 'default.rgw.users.uid' replicated size 3 min_size 2 crush_rule 0
object_hash rjenkins pg_num 8 pgp_num 8 last_change 4502 flags hashpspool
stripe_width 0 application rgw
pool 10 'default.rgw.usage' replicated size 3 min_size 2 crush_rule 0
object_hash rjenkins pg_num 8 pgp_num 8 last_change 4502 flags hashpspool
stripe_width 0 application rgw
pool 11 'default.rgw.users.email' replicated size 3 min_size 2 crush_rule
0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 4502 owner
18446744073709551615 flags hashpspool stripe_width 0 application rgw
pool 12 'default.rgw.users.keys' replicated size 3 min_size 2 crush_rule 0
object_hash rjenkins pg_num 8 pgp_num 8 last_change 4502 owner
18446744073709551615 flags hashpspool stripe_width 0 application rgw
pool 13 'default.rgw.users.swift' replicated size 3 min_size 2 crush_rule
0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 4502 flags hashpspool
stripe_width 0 application rgw
pool 14 'default.rgw.buckets.index' replicated size 3 min_size 2
crush_rule 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 4502 flags
hashpspool stripe_width 0 application rgw
pool 15 'default.rgw.buckets.data.old' replicated size 3 min_size 2
crush_rule 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 4502
flags hashpspool stripe_width 0 application rgw
pool 16 'default.rgw.buckets.non-ec' replicated size 3 min_size 2
crush_rule 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 4502 flags
hashpspool stripe_width 0 application rgw
pool 17 'default.rgw.buckets.extra' replicated size 3 min_size 2
crush_rule 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 4502 flags
hashpspool stripe_width 0 application rgw
pool 18 '.rgw.buckets.extra' replicated size 3 min_size 2 crush_rule 0
object_hash rjenkins pg_num 8 pgp_num 8 last_change 4502 flags hashpspool
stripe_width 0 application rgw
pool 20 'default.rgw.buckets.data' erasure size 9 min_size 6 crush_rule 1
object_hash rjenkins pg_num 1024 pgp_num 1024 last_change 4502 flags
hashpspool stripe_width 4224 application rgw
pool 21 'benchmark_replicated' replicated size 3 min_size 2 crush_rule 0
object_hash rjenkins pg_num 1024 pgp_num 1024 last_change 4550 flags
hashpspool stripe_width 0 application rbd
removed_snaps [1~3]
pool 22 'benchmark_erasure_coded' erasure size 9 min_size 7 crush_rule 1
object_hash rjenkins pg_num 32 pgp_num 32 last_change 4552 flags hashpspool
stripe_width 24576 application rbd
removed_snaps [1~3]
# ceph df detail
SIZE AVAIL RAW USED %RAW USED OBJECTS
385T 144T 241T 62.54 31023k
NAME ID QUOTA OBJECTS QUOTA BYTES
USED %USED MAX AVAIL OBJECTS DIRTY READ WRITE
RAW USED
rbd 0 N/A N/A
0 0 16557G 0 0 1 134k
0
vms 1 N/A N/A
0 0 16557G 0 0 0 0
0
images 2 N/A N/A
7659M 0.05 16557G 1022 1022 51247 5668
22977M
volumes 3 N/A N/A
40351G 70.91 16557G 10352314 10109k 2130M 2520M
118T
.rgw.root 4 N/A N/A
1588 0 16557G 4 4 90 4
4764
default.rgw.control 5 N/A N/A
0 0 16557G 8 8 0 0
0
default.rgw.data.root 6 N/A N/A
93943 0 16557G 336 336 239k 6393
275k
default.rgw.gc 7 N/A N/A
0 0 16557G 32 32 1773M 5281k
0
default.rgw.log 8 N/A N/A
0 0 16557G 185 185 22404k 14936k
0
default.rgw.users.uid 9 N/A N/A
3815 0 16557G 15 15 187k 53303
11445
default.rgw.usage 10 N/A N/A
0 0 16557G 7 7 278k 556k
0
default.rgw.users.email 11 N/A N/A
58 0 16557G 3 3 0 3
174
default.rgw.users.keys 12 N/A N/A
177 0 16557G 10 10 262 22
531
default.rgw.users.swift 13 N/A N/A
40 0 16557G 3 3 0 3
120
default.rgw.buckets.index 14 N/A N/A
0 0 16557G 205 205 160M 27945k
0
default.rgw.buckets.data.old 15 N/A N/A
668G 3.88 16557G 180867 176k 707k 2318k
2004G
default.rgw.buckets.non-ec 16 N/A N/A
0 0 16557G 114 114 17960 12024
0
default.rgw.buckets.extra 17 N/A N/A
0 0 16557G 0 0 0 0
0
.rgw.buckets.extra 18 N/A N/A
0 0 16557G 0 0 0 0
0
default.rgw.buckets.data 20 N/A N/A
79190G 70.51 33115G 20865953 20376k 122M 113M
116T
benchmark_replicated 21 N/A N/A
1415G 7.88 16557G 363800 355k 1338k 1251k
4247G
benchmark_erasure_coded 22 N/A N/A
11057M 0.03 33115G 2761 2761 398 5520
16586M
Thanks
Jakub
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Jakub Jaszewski
2018-01-31 21:10:26 UTC
Permalink
Raw Message
Hi Luis,

Thanks for your comment, I see high %util for few HDDs per each ceph node
but actually there is very low traffic from client.

iostat -xd shows ongoing operations

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz
avgqu-sz await r_await w_await svctm %util
sda 0,00 1,60 0,00 3,00 0,00 18,40 12,27
0,00 0,00 0,00 0,00 0,00 0,00
sdb 0,00 0,20 0,60 8,00 14,40 488,10 116,86
0,00 0,56 8,00 0,00 0,56 0,48
sdc 0,00 0,00 153,80 1,80 25304,00 31,30 325,65
1,12 7,20 7,28 0,00 4,21 65,52
sdd 0,00 5,40 406,80 44,00 102275,20 3295,60
468,37 1,85 4,12 4,29 2,53 2,11 95,12
sde 0,00 0,60 3,20 12,00 51,20 2461,50 330,62
0,07 4,32 12,25 2,20 2,63 4,00
sdf 0,00 0,40 1,40 8,20 44,00 1424,90 306,02
0,01 1,50 10,29 0,00 1,50 1,44
sdg 0,00 0,60 92,80 19,00 5483,20 2998,90 151,74
0,98 8,74 10,36 0,84 7,40 82,72
sdh 0,00 0,00 154,40 1,40 25299,20 74,20 325,72
1,07 6,88 6,94 0,00 4,07 63,44
sdi 0,00 0,00 0,20 7,80 12,80 397,50 102,58
0,00 0,30 12,00 0,00 0,30 0,24
sdj 0,00 0,20 0,00 4,00 0,00 645,60 322,80
0,00 0,00 0,00 0,00 0,00 0,00
sdk 0,00 0,20 1,40 15,60 32,00 1956,50 233,94
0,02 1,08 13,14 0,00 1,08 1,84
sdl 0,00 0,40 0,60 4,00 16,80 447,00 201,65
0,02 4,35 20,00 2,00 2,78 1,28
sdm 0,00 0,00 10,00 4,40 232,00 521,80 104,69
0,08 5,89 8,48 0,00 4,89 7,04
dm-0 0,00 0,00 0,00 4,60 0,00 18,40 8,00
0,00 0,00 0,00 0,00 0,00 0,00
nvme0n1 0,00 0,00 0,00 124,80 0,00 10366,40 166,13
0,01 0,12 0,00 0,12 0,03 0,32

when ceph -s shows low client traffic

# ceph -s
cluster:
id: 1023c49f-3b10-42de-9f62-9b122db32a9a
health: HEALTH_OK

services:
mon: 3 daemons, quorum host01,host02,host03
mgr: host02(active), standbys: host01, host03
osd: 108 osds: 106 up, 106 in
rgw: 3 daemons active

data:
pools: 22 pools, 4880 pgs
objects: 31121k objects, 119 TB
usage: 241 TB used, 143 TB / 385 TB avail
pgs: 4875 active+clean
3 active+clean+scrubbing+deep
2 active+clean+scrubbing

io:
client: 17646 B/s rd, 19038 kB/s wr, 4 op/s rd, 175 op/s wr


Is there any background tasks running and utilizing disks? Is it scrubbing
generating this load?
3 active+clean+scrubbing+deep
2 active+clean+scrubbing




Thanks
Jakub
Post by Luis Periquito
on a cursory look of the information it seems the cluster is
overloaded with the requests.
Just a guess, but if you look at IO usage on those spindles they'll be
at or around 100% usage most of the time.
If that is the case then increasing the pg_num and pgp_num won't help,
and short term, will make it worse.
Metadata pools (like default.rgw.buckets.index) really excel in a SSD
pool, even if small. I carved a small OSD in the journal SSDs for
those kinds of workloads.
On Wed, Jan 31, 2018 at 2:26 PM, Jakub Jaszewski
Post by Jakub Jaszewski
Is it safe to increase pg_num and pgp_num from 1024 up to 2048 for
volumes
Post by Jakub Jaszewski
and default.rgw.buckets.data pools?
How will it impact cluster behavior? I guess cluster rebalancing will
occur
Post by Jakub Jaszewski
and will take long time considering amount of data we have on it ?
Regards
Jakub
On Wed, Jan 31, 2018 at 1:37 PM, Jakub Jaszewski <
Post by Jakub Jaszewski
Hi,
I'm wondering why slow requests are being reported mainly when the
request
Post by Jakub Jaszewski
Post by Jakub Jaszewski
has been put into the queue for processing by its PG (queued_for_pg ,
http://docs.ceph.com/docs/master/rados/troubleshooting/
troubleshooting-osd/#debugging-slow-request).
Post by Jakub Jaszewski
Post by Jakub Jaszewski
Could it be due too low pg_num/pgp_num ?
It looks that slow requests are mainly addressed to
default.rgw.buckets.data (pool id 20) , volumes (pool id 3) and
default.rgw.buckets.index (pool id 14)
cluster [WRN] slow request 30.125793 seconds old, received at 2018-01-31
12:06:25.773675: osd_op(client.857003.0:126171692 3.a4fec1ad 3.a4fec1ad
(undecoded) ack+ondisk+write+known_if_redirected e5722) currently
queued_for_pg
Btw how can I get more human-friendly client information from log entry
like above ?
Current pg_num/pgp_num
pool 3 'volumes' replicated size 3 min_size 2 crush_rule 0 object_hash
rjenkins pg_num 1024 pgp_num 1024 last_change 4502 flags hashpspool
stripe_width 0 application rbd
removed_snaps [1~3]
pool 14 'default.rgw.buckets.index' replicated size 3 min_size 2
crush_rule 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 4502
flags
Post by Jakub Jaszewski
Post by Jakub Jaszewski
hashpspool stripe_width 0 application rgw
pool 20 'default.rgw.buckets.data' erasure size 9 min_size 6 crush_rule
1
Post by Jakub Jaszewski
Post by Jakub Jaszewski
object_hash rjenkins pg_num 1024 pgp_num 1024 last_change 4502 flags
hashpspool stripe_width 4224 application rgw
Usage
SIZE AVAIL RAW USED %RAW USED OBJECTS
385T 144T 241T 62.54 31023k
NAME ID QUOTA OBJECTS QUOTA
BYTES
Post by Jakub Jaszewski
Post by Jakub Jaszewski
USED %USED MAX AVAIL OBJECTS DIRTY READ
WRITE
Post by Jakub Jaszewski
Post by Jakub Jaszewski
RAW USED
volumes 3 N/A N/A
40351G 70.91 16557G 10352314 10109k 2130M
2520M
Post by Jakub Jaszewski
Post by Jakub Jaszewski
118T
default.rgw.buckets.index 14 N/A N/A
0 0 16557G 205 205 160M 27945k
0
default.rgw.buckets.data 20 N/A N/A
79190G 70.51 33115G 20865953 20376k 122M
113M
Post by Jakub Jaszewski
Post by Jakub Jaszewski
116T
# ceph osd pool ls detail
pool 0 'rbd' replicated size 3 min_size 2 crush_rule 0 object_hash
rjenkins pg_num 64 pgp_num 64 last_change 4502 flags hashpspool
stripe_width
Post by Jakub Jaszewski
Post by Jakub Jaszewski
0 application rbd
pool 1 'vms' replicated size 3 min_size 2 crush_rule 0 object_hash
rjenkins pg_num 1024 pgp_num 1024 last_change 4502 flags hashpspool
stripe_width 0 application rbd
pool 2 'images' replicated size 3 min_size 2 crush_rule 0 object_hash
rjenkins pg_num 512 pgp_num 512 last_change 5175 flags hashpspool
stripe_width 0 application rbd
removed_snaps [1~7,14~2]
pool 3 'volumes' replicated size 3 min_size 2 crush_rule 0 object_hash
rjenkins pg_num 1024 pgp_num 1024 last_change 4502 flags hashpspool
stripe_width 0 application rbd
removed_snaps [1~3]
pool 4 '.rgw.root' replicated size 3 min_size 2 crush_rule 0 object_hash
rjenkins pg_num 8 pgp_num 8 last_change 4502 flags hashpspool
stripe_width 0
Post by Jakub Jaszewski
Post by Jakub Jaszewski
application rgw
pool 5 'default.rgw.control' replicated size 3 min_size 2 crush_rule 0
object_hash rjenkins pg_num 8 pgp_num 8 last_change 4502 flags
hashpspool
Post by Jakub Jaszewski
Post by Jakub Jaszewski
stripe_width 0 application rgw
pool 6 'default.rgw.data.root' replicated size 3 min_size 2 crush_rule 0
object_hash rjenkins pg_num 8 pgp_num 8 last_change 4502 flags
hashpspool
Post by Jakub Jaszewski
Post by Jakub Jaszewski
stripe_width 0 application rgw
pool 7 'default.rgw.gc' replicated size 3 min_size 2 crush_rule 0
object_hash rjenkins pg_num 8 pgp_num 8 last_change 4502 flags
hashpspool
Post by Jakub Jaszewski
Post by Jakub Jaszewski
stripe_width 0 application rgw
pool 8 'default.rgw.log' replicated size 3 min_size 2 crush_rule 0
object_hash rjenkins pg_num 8 pgp_num 8 last_change 4502 flags
hashpspool
Post by Jakub Jaszewski
Post by Jakub Jaszewski
stripe_width 0 application rgw
pool 9 'default.rgw.users.uid' replicated size 3 min_size 2 crush_rule 0
object_hash rjenkins pg_num 8 pgp_num 8 last_change 4502 flags
hashpspool
Post by Jakub Jaszewski
Post by Jakub Jaszewski
stripe_width 0 application rgw
pool 10 'default.rgw.usage' replicated size 3 min_size 2 crush_rule 0
object_hash rjenkins pg_num 8 pgp_num 8 last_change 4502 flags
hashpspool
Post by Jakub Jaszewski
Post by Jakub Jaszewski
stripe_width 0 application rgw
pool 11 'default.rgw.users.email' replicated size 3 min_size 2
crush_rule
Post by Jakub Jaszewski
Post by Jakub Jaszewski
0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 4502 owner
18446744073709551615 flags hashpspool stripe_width 0 application rgw
pool 12 'default.rgw.users.keys' replicated size 3 min_size 2
crush_rule 0
Post by Jakub Jaszewski
Post by Jakub Jaszewski
object_hash rjenkins pg_num 8 pgp_num 8 last_change 4502 owner
18446744073709551615 flags hashpspool stripe_width 0 application rgw
pool 13 'default.rgw.users.swift' replicated size 3 min_size 2
crush_rule
Post by Jakub Jaszewski
Post by Jakub Jaszewski
0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 4502 flags
hashpspool
Post by Jakub Jaszewski
Post by Jakub Jaszewski
stripe_width 0 application rgw
pool 14 'default.rgw.buckets.index' replicated size 3 min_size 2
crush_rule 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 4502
flags
Post by Jakub Jaszewski
Post by Jakub Jaszewski
hashpspool stripe_width 0 application rgw
pool 15 'default.rgw.buckets.data.old' replicated size 3 min_size 2
crush_rule 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 4502
flags hashpspool stripe_width 0 application rgw
pool 16 'default.rgw.buckets.non-ec' replicated size 3 min_size 2
crush_rule 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 4502
flags
Post by Jakub Jaszewski
Post by Jakub Jaszewski
hashpspool stripe_width 0 application rgw
pool 17 'default.rgw.buckets.extra' replicated size 3 min_size 2
crush_rule 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 4502
flags
Post by Jakub Jaszewski
Post by Jakub Jaszewski
hashpspool stripe_width 0 application rgw
pool 18 '.rgw.buckets.extra' replicated size 3 min_size 2 crush_rule 0
object_hash rjenkins pg_num 8 pgp_num 8 last_change 4502 flags
hashpspool
Post by Jakub Jaszewski
Post by Jakub Jaszewski
stripe_width 0 application rgw
pool 20 'default.rgw.buckets.data' erasure size 9 min_size 6 crush_rule
1
Post by Jakub Jaszewski
Post by Jakub Jaszewski
object_hash rjenkins pg_num 1024 pgp_num 1024 last_change 4502 flags
hashpspool stripe_width 4224 application rgw
pool 21 'benchmark_replicated' replicated size 3 min_size 2 crush_rule 0
object_hash rjenkins pg_num 1024 pgp_num 1024 last_change 4550 flags
hashpspool stripe_width 0 application rbd
removed_snaps [1~3]
pool 22 'benchmark_erasure_coded' erasure size 9 min_size 7 crush_rule 1
object_hash rjenkins pg_num 32 pgp_num 32 last_change 4552 flags
hashpspool
Post by Jakub Jaszewski
Post by Jakub Jaszewski
stripe_width 24576 application rbd
removed_snaps [1~3]
# ceph df detail
SIZE AVAIL RAW USED %RAW USED OBJECTS
385T 144T 241T 62.54 31023k
NAME ID QUOTA OBJECTS QUOTA
BYTES
Post by Jakub Jaszewski
Post by Jakub Jaszewski
USED %USED MAX AVAIL OBJECTS DIRTY READ
WRITE
Post by Jakub Jaszewski
Post by Jakub Jaszewski
RAW USED
rbd 0 N/A N/A
0 0 16557G 0 0 1 134k
0
vms 1 N/A N/A
0 0 16557G 0 0 0 0
0
images 2 N/A N/A
7659M 0.05 16557G 1022 1022 51247
5668
Post by Jakub Jaszewski
Post by Jakub Jaszewski
22977M
volumes 3 N/A N/A
40351G 70.91 16557G 10352314 10109k 2130M
2520M
Post by Jakub Jaszewski
Post by Jakub Jaszewski
118T
.rgw.root 4 N/A N/A
1588 0 16557G 4 4 90
4
Post by Jakub Jaszewski
Post by Jakub Jaszewski
4764
default.rgw.control 5 N/A N/A
0 0 16557G 8 8 0 0
0
default.rgw.data.root 6 N/A N/A
93943 0 16557G 336 336 239k
6393
Post by Jakub Jaszewski
Post by Jakub Jaszewski
275k
default.rgw.gc 7 N/A N/A
0 0 16557G 32 32 1773M 5281k
0
default.rgw.log 8 N/A N/A
0 0 16557G 185 185 22404k 14936k
0
default.rgw.users.uid 9 N/A N/A
3815 0 16557G 15 15 187k
53303
Post by Jakub Jaszewski
Post by Jakub Jaszewski
11445
default.rgw.usage 10 N/A N/A
0 0 16557G 7 7 278k 556k
0
default.rgw.users.email 11 N/A N/A
58 0 16557G 3 3 0 3
174
default.rgw.users.keys 12 N/A N/A
177 0 16557G 10 10 262
22
Post by Jakub Jaszewski
Post by Jakub Jaszewski
531
default.rgw.users.swift 13 N/A N/A
40 0 16557G 3 3 0 3
120
default.rgw.buckets.index 14 N/A N/A
0 0 16557G 205 205 160M 27945k
0
default.rgw.buckets.data.old 15 N/A N/A
668G 3.88 16557G 180867 176k 707k
2318k
Post by Jakub Jaszewski
Post by Jakub Jaszewski
2004G
default.rgw.buckets.non-ec 16 N/A N/A
0 0 16557G 114 114 17960 12024
0
default.rgw.buckets.extra 17 N/A N/A
0 0 16557G 0 0 0 0
0
.rgw.buckets.extra 18 N/A N/A
0 0 16557G 0 0 0 0
0
default.rgw.buckets.data 20 N/A N/A
79190G 70.51 33115G 20865953 20376k 122M
113M
Post by Jakub Jaszewski
Post by Jakub Jaszewski
116T
benchmark_replicated 21 N/A N/A
1415G 7.88 16557G 363800 355k 1338k
1251k
Post by Jakub Jaszewski
Post by Jakub Jaszewski
4247G
benchmark_erasure_coded 22 N/A N/A
11057M 0.03 33115G 2761 2761 398
5520
Post by Jakub Jaszewski
Post by Jakub Jaszewski
16586M
Thanks
Jakub
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Sergey Malinin
2018-01-31 23:43:57 UTC
Permalink
Raw Message
Deep scrub is I/O-expensive. If deep scrub is unnecessary, you can disable it with "ceph osd pool set <poolname> nodeep-scrub".
Post by Jakub Jaszewski
3 active+clean+scrubbing+deep
Jakub Jaszewski
2018-02-01 08:33:33 UTC
Permalink
Raw Message
Thanks, I've temporary disabled both scrubbing and deep-scrubbing, things
are getting better I feel.

I just noticed high traffic generated on pool default.rgw.gc

pool default.rgw.gc id 7
client io 2162 MB/s rd, 0 B/s wr, 3023 op/s rd, 0 op/s wr


There is lot of data written via radosgw

GLOBAL:
SIZE AVAIL RAW USED %RAW USED OBJECTS
385T 142T 243T 63.08 31384k
POOLS:
NAME ID QUOTA OBJECTS
QUOTA BYTES USED %USED MAX AVAIL OBJECTS DIRTY
READ WRITE RAW USED
volumes 3 N/A
N/A 40352G 72.19 15547G 10352564 10109k
2155M 2574M 118T
default.rgw.gc 7 N/A
N/A 0 0 15547G 32 32
1820M 5352k 0
default.rgw.buckets.data 20 N/A
N/A 80596G 72.16 31094G 21235439 20737k
123M 114M 118T


Although reads stats on default.rgw.gc are high according to ceph osd pool
stats command, I don't see much rMB/s on HDDs on any cluster node. However
r/s seems to be significant? . Any idea why it is like that ?

# iostat -xdm 5 1000

Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz
avgqu-sz await r_await w_await svctm %util
sda 0,00 2,20 0,00 2,60 0,00 0,02 14,77
0,00 0,00 0,00 0,00 0,00 0,00
sdb 0,00 0,40 35,80 10,40 0,66 1,74 106,04
0,40 8,74 10,48 2,77 6,86 31,68
sdc 0,00 0,60 45,80 15,60 0,78 3,06 127,87
0,63 10,20 12,84 2,46 7,10 43,60
sdd 0,00 0,60 27,00 10,60 0,43 1,79 121,25
0,31 8,19 11,41 0,00 6,60 24,80
sde 0,00 1,40 36,80 18,80 0,65 3,96 169,60
0,51 9,17 12,85 1,96 5,99 33,28
sdf 0,00 0,00 0,00 0,00 0,00 0,00 0,00
0,00 0,00 0,00 0,00 0,00 0,00
sdg 0,00 0,40 33,40 7,00 0,52 0,93 73,26
0,41 10,16 12,10 0,91 7,37 29,76
sdh 0,00 1,40 53,60 13,60 0,94 2,62 108,32
0,70 10,45 12,61 1,94 7,42 49,84
sdi 0,00 1,00 34,40 10,80 0,61 2,14 124,82
0,38 8,44 10,37 2,30 6,57 29,68
sdj 0,00 0,00 0,00 0,00 0,00 0,00 0,00
0,00 0,00 0,00 0,00 0,00 0,00
sdk 0,00 1,00 39,20 13,20 0,69 2,66 131,13
0,60 11,45 14,31 2,97 7,21 37,76
sdl 0,00 0,80 25,40 11,20 0,45 1,92 132,86
0,29 8,00 10,49 2,36 6,54 23,92
sdm 0,00 1,00 37,20 8,00 0,72 1,46 98,82
0,44 9,73 11,08 3,50 7,86 35,52
nvme0n1 0,00 0,00 0,00 700,60 0,00 20,34 59,47
0,07 0,10 0,00 0,10 0,02 1,44
dm-0 0,00 0,00 0,00 4,80 0,00 0,02 8,00
0,00 0,00 0,00 0,00 0,00 0,00


Can I schedule garbage collection process to run on particular days/hours ?

Rados bench looks quite fine during garbage collection run I think.

# rados bench -p benchmark_replicated 20 write
hints = 1
Maintaining 16 concurrent writes of 4194304 bytes to objects of size
4194304 for up to 20 seconds or 0 objects
Object prefix: benchmark_data_sg08-09_175315
sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg
lat(s)
0 0 0 0 0 0 -
0
1 16 364 348 1391.88 1392 0.0638181
0.0443111
2 16 711 695 1389.83 1388 0.0166734
0.0454365
3 16 1052 1036 1381.15 1364 0.0756525
0.0458462
4 16 1401 1385 1384.82 1396 0.0800253
0.0459712
5 16 1764 1748 1398.21 1452 0.048202
0.0455071
6 16 2117 2101 1400.48 1412 0.0154943
0.0455181
7 16 2468 2452 1400.96 1404 0.0324203
0.0454904
8 16 2809 2793 1396.32 1364 0.0425057
0.0456571
9 16 3175 3159 1403.82 1464 0.0535376
0.0454215
10 16 3541 3525 1409.82 1464 0.0668655
0.0452501
11 16 3911 3895 1416.18 1480 0.0506286
0.0450723
12 16 4267 4251 1416.82 1424 0.0266732
0.0450444
13 16 4615 4599 1414.9 1392 0.101581
0.045124
14 16 4977 4961 1417.25 1448 0.0342007
0.0450346
15 16 5351 5335 1422.48 1496 0.022117
0.0449238
16 16 5691 5675 1418.57 1360 0.022683
0.0450504
17 16 6035 6019 1416.05 1376 0.0702069
0.0451103
18 16 6397 6381 1417.82 1448 0.0231964
0.0450781
19 16 6750 6734 1417.5 1412 0.0131453
0.0450462
2018-02-01 08:30:57.941618 min lat: 0.0117176 max lat: 0.794775 avg lat:
0.0451095
sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg
lat(s)
20 16 7100 7084 1416.62 1400 0.0239063
0.0451095
Total time run: 20.040338
Total writes made: 7100
Write size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 1417.14
Stddev Bandwidth: 40.6598
Max bandwidth (MB/sec): 1496
Min bandwidth (MB/sec): 1360
Average IOPS: 354
Stddev IOPS: 10
Max IOPS: 374
Min IOPS: 340
Average Latency(s): 0.0451394
Stddev Latency(s): 0.0264402
Max latency(s): 0.794775
Min latency(s): 0.0117176
Cleaning up (deleting benchmark objects)
Removed 7100 objects
Clean up completed and total clean up time :0.658175


Thanks
Jakub
Post by Sergey Malinin
Deep scrub is I/O-expensive. If deep scrub is unnecessary, you can disable
it with "ceph osd pool set <poolname> nodeep-scrub".
3 active+clean+scrubbing+deep
Jakub Jaszewski
2018-02-01 14:33:01 UTC
Permalink
Raw Message
Regarding split & merge, I have default values
filestore_merge_threshold = 10
filestore_split_multiple = 2

according to https://bugzilla.redhat.com/show_bug.cgi?id=1219974 the
recommended values are

filestore_merge_threshold = 40
filestore_split_multiple = 8

Is it something that I can easily change to default or lower values than
proposed in case of further performance degradation ?

I did tests of 4 pools: 2 replicated pools (x3 ) and 2 EC pools (k=6,m=3)

The pool with the lowest bandwidth has osd tree structure like
├── 20.115s1_head
│ └── DIR_5
│ └── DIR_1
│ ├── DIR_1
│ │ ├── DIR_0
│ │ ├── DIR_1
│ │ ├── DIR_2
│ │ │ ├── DIR_0
│ │ │ ├── DIR_1
│ │ │ ├── DIR_2
│ │ │ ├── DIR_3
│ │ │ ├── DIR_4
│ │ │ ├── DIR_5
│ │ │ ├── DIR_6
│ │ │ ├── DIR_7
│ │ │ ├── DIR_8
│ │ │ ├── DIR_9
│ │ │ ├── DIR_A
│ │ │ ├── DIR_B
│ │ │ ├── DIR_C
│ │ │ ├── DIR_D
│ │ │ ├── DIR_E
│ │ │ └── DIR_F


Tests results

# rados bench -p default.rgw.buckets.data 10 write
hints = 1
Maintaining 16 concurrent writes of 4194432 bytes to objects of size
4194432 for up to 10 seconds or 0 objects
Object prefix: benchmark_data_sg08-09_180679
sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg
lat(s)
0 0 0 0 0 0 -
0
1 16 129 113 451.975 452.014 0.0376714
0.128027
2 16 209 193 385.964 320.01 0.119609
0.138517
3 16 235 219 291.974 104.003 0.0337624
0.13731
4 16 235 219 218.981 0 -
0.13731
5 16 266 250 199.983 62.0019 0.111673
0.238424
6 16 317 301 200.649 204.006 0.0340569
0.298489
7 16 396 380 217.124 316.01 0.0379956
0.283458
8 16 444 428 213.981 192.006 0.0304383
0.274193
9 16 485 469 208.426 164.005 0.391956
0.283421
10 16 496 480 191.983 44.0013 0.104497
0.292074
11 16 497 481 174.894 4.00012 0.999985
0.293545
12 16 497 481 160.32 0 -
0.293545
13 16 497 481 147.987 0 -
0.293545
14 16 497 481 137.417 0 -
0.293545
Total time run: 14.493353
Total writes made: 497
Write size: 4194432
Object size: 4194432
Bandwidth (MB/sec): 137.171
Stddev Bandwidth: 147.001
Max bandwidth (MB/sec): 452.014
Min bandwidth (MB/sec): 0
Average IOPS: 34
Stddev IOPS: 36
Max IOPS: 113
Min IOPS: 0
Average Latency(s): 0.464281
Stddev Latency(s): 1.09388
Max latency(s): 6.3723
Min latency(s): 0.023835
Cleaning up (deleting benchmark objects)
Removed 497 objects
Clean up completed and total clean up time :10.622382
#


# rados bench -p benchmark_erasure_coded 10 write
hints = 1
Maintaining 16 concurrent writes of 4202496 bytes to objects of size
4202496 for up to 10 seconds or 0 objects
Object prefix: benchmark_data_sg08-09_180807
sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg
lat(s)
0 0 0 0 0 0 -
0
1 16 424 408 1635.11 1635.19 0.0490434
0.0379616
2 16 828 812 1627.03 1619.16 0.0616501
0.0388467
3 16 1258 1242 1659.06 1723.36 0.0304412
0.0384537
4 16 1659 1643 1646.03 1607.13 0.0155402
0.0387351
5 16 2053 2037 1632.61 1579.08 0.0453354
0.0390236
6 16 2455 2439 1629 1611.14 0.0485313
0.0392376
7 16 2649 2633 1507.34 777.516 0.0148972
0.0393161
8 16 2858 2842 1423.61 837.633 0.0157639
0.0449088
9 16 3245 3229 1437.75 1551.02 0.0200845
0.0444847
10 16 3629 3613 1447.85 1539 0.0654451
0.0441569
Total time run: 10.229591
Total writes made: 3630
Write size: 4202496
Object size: 4202496
Bandwidth (MB/sec): 1422.18
Stddev Bandwidth: 341.609
Max bandwidth (MB/sec): 1723.36
Min bandwidth (MB/sec): 777.516
Average IOPS: 354
Stddev IOPS: 85
Max IOPS: 430
Min IOPS: 194
Average Latency(s): 0.0448612
Stddev Latency(s): 0.0712224
Max latency(s): 1.08353
Min latency(s): 0.0134629
Cleaning up (deleting benchmark objects)
Removed 3630 objects
Clean up completed and total clean up time :2.321669
#



# rados bench -p volumes 10 write
hints = 1
Maintaining 16 concurrent writes of 4194304 bytes to objects of size
4194304 for up to 10 seconds or 0 objects
Object prefix: benchmark_data_sg08-09_180651
sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg
lat(s)
0 0 0 0 0 0 -
0
1 16 336 320 1279.89 1280 0.0309006
0.0472524
2 16 653 637 1273.84 1268 0.0465151
0.0495701
3 16 956 940 1253.17 1212 0.0337327
0.0504146
4 16 1256 1240 1239.85 1200 0.0177263
0.0509145
5 16 1555 1539 1231.05 1196 0.0364991
0.0516724
6 16 1868 1852 1234.51 1252 0.0260964
0.0510236
7 16 2211 2195 1254.13 1372 0.040738
0.050847
8 16 2493 2477 1238.35 1128 0.0228582
0.0514979
9 16 2838 2822 1254.07 1380 0.0265224
0.0508641
10 16 3116 3100 1239.85 1112 0.0160151
0.0513104
Total time run: 10.192091
Total writes made: 3117
Write size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 1223.3
Stddev Bandwidth: 89.9383
Max bandwidth (MB/sec): 1380
Min bandwidth (MB/sec): 1112
Average IOPS: 305
Stddev IOPS: 22
Max IOPS: 345
Min IOPS: 278
Average Latency(s): 0.0518144
Stddev Latency(s): 0.0529575
Max latency(s): 0.663523
Min latency(s): 0.0122169
Cleaning up (deleting benchmark objects)
Removed 3117 objects
Clean up completed and total clean up time :0.212296
#



# rados bench -p benchmark_replicated 10 write
hints = 1
Maintaining 16 concurrent writes of 4194304 bytes to objects of size
4194304 for up to 10 seconds or 0 objects
Object prefix: benchmark_data_sg08-09_180779
sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg
lat(s)
0 0 0 0 0 0 -
0
1 16 309 293 1171.94 1172 0.0233267
0.0508877
2 16 632 616 1231.87 1292 0.0258237
0.049612
3 16 959 943 1257.19 1308 0.0335615
0.0483464
4 16 1276 1260 1259.85 1268 0.031461
0.0504689
5 16 1643 1627 1301.44 1468 0.0274032
0.0489651
6 16 1991 1975 1316.51 1392 0.0408116
0.0483596
7 16 2328 2312 1320.98 1348 0.0242298
0.048175
8 16 2677 2661 1330.33 1396 0.097513
0.047962
9 16 3042 3026 1344.72 1460 0.0196724
0.0474078
10 16 3384 3368 1347.03 1368 0.0426199
0.0472573
Total time run: 10.482871
Total writes made: 3384
Write size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 1291.25
Stddev Bandwidth: 90.4861
Max bandwidth (MB/sec): 1468
Min bandwidth (MB/sec): 1172
Average IOPS: 322
Stddev IOPS: 22
Max IOPS: 367
Min IOPS: 293
Average Latency(s): 0.048763
Stddev Latency(s): 0.0547666
Max latency(s): 0.938211
Min latency(s): 0.0121556
Cleaning up (deleting benchmark objects)
Removed 3384 objects
Clean up completed and total clean up time :0.239684
#



Luis why did you advise against increasing pg_num pgp_num ? I'm wondering
which option is better: increasing pg_num or filestore_merge_threshold and
filestore_split_multiple ?

Thanks
Jakub


On Thu, Feb 1, 2018 at 9:38 AM, Jaroslaw Owsiewski <
Hi,
maybe "split is on the floor"?
Regards
--
Jarek
Jakub Jaszewski
2018-02-02 08:55:46 UTC
Permalink
Raw Message
Hi,

So I have changed merge & split settings to

filestore_merge_threshold = 40
filestore_split_multiple = 8

and restart all OSDs , host by host.

Let me ask a question, although the pool default.rgw.buckets.data that was
affected prior to the above change has higher write bandwidth it is very
random now. Writes are random for other pools (same for EC and replicated
types) too, before the change writes to replicated pools were much more
stable.
Reads from pools look fine and stable.

Is it the result of mentioned change ? Is PG directory structure updating
or ...?




# rados bench -p default.rgw.buckets.data 10 write --no-cleanup
hints = 1
Maintaining 16 concurrent writes of 4194432 bytes to objects of size
4194432 for up to 10 seconds or 0 objects
Object prefix: benchmark_data_sg08-09_19744
sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg
lat(s)
0 0 0 0 0 0 -
0
1 16 169 153 611.982 612.019 0.0547835
0.096198
2 16 314 298 595.958 580.018 0.0485904
0.103653
3 16 454 438 583.956 560.017 0.0747779
0.106501
4 16 556 540 539.96 408.012 0.111734
0.110132
5 16 624 608 486.364 272.008 0.111169
0.120221
6 16 651 635 423.302 108.003 0.0706783
0.126086
7 16 730 714 407.97 316.01 0.362838
0.153117
8 16 802 786 392.97 288.009 0.0261249
0.154728
9 16 858 842 374.194 224.007 0.0703766
0.159723
10 16 913 897 358.773 220.007 0.169544
0.173459
Total time run: 10.386646
Total writes made: 913
Write size: 4194432
Object size: 4194432
Bandwidth (MB/sec): 351.616
Stddev Bandwidth: 173.421
Max bandwidth (MB/sec): 612.019
Min bandwidth (MB/sec): 108.003
Average IOPS: 87
Stddev IOPS: 43
Max IOPS: 153
Min IOPS: 27
Average Latency(s): 0.179969
Stddev Latency(s): 0.321098
Max latency(s): 2.60469
Min latency(s): 0.0209669





# rados bench -p default.rgw.buckets.data 10 seq
hints = 1
sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg
lat(s)
0 0 0 0 0 0 -
0
1 15 470 455 1818.75 1820.06 0.0462129
0.0337452
eTotal time run: 1.960388
Total reads made: 913
Read size: 4194432
Object size: 4194432
Bandwidth (MB/sec): 1862.95
Average IOPS: 465
Stddev IOPS: 0
Max IOPS: 455
Min IOPS: 455
Average Latency(s): 0.0335775
Max latency(s): 0.259385
Min latency(s): 0.0169049
#

# rados bench -p default.rgw.buckets.data 10 rand
hints = 1
sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg
lat(s)
0 0 0 0 0 0 -
0
1 15 476 461 1843.81 1844.06 0.0272928
0.0335396
2 15 954 939 1877.83 1912.06 0.0434087
0.0332626
3 16 1436 1420 1893.16 1924.06 0.0408819
0.0330526
4 15 1925 1910 1909.84 1960.06 0.0259839
0.0328248
5 15 2408 2393 1914.25 1932.06 0.0427577
0.0328096
6 16 2891 2875 1916.52 1928.06 0.0273633
0.0327565
7 15 3377 3362 1921 1948.06 0.0294795
0.032736
8 16 3875 3859 1929.36 1988.06 0.0242193
0.03258
9 16 4351 4335 1926.53 1904.06 0.0203889
0.0326336
10 16 4855 4839 1935.46 2016.06 0.0325821
0.0324852
Total time run: 10.032589
Total reads made: 4856
Read size: 4194432
Object size: 4194432
Bandwidth (MB/sec): 1936.15
Average IOPS: 484
Stddev IOPS: 11
Max IOPS: 504
Min IOPS: 461
Average Latency(s): 0.0325118
Max latency(s): 0.250729
Min latency(s): 0.0149525
#








# rados bench -p benchmark_erasure_coded 10 write --no-cleanup
hints = 1
Maintaining 16 concurrent writes of 4202496 bytes to objects of size
4202496 for up to 10 seconds or 0 objects
Object prefix: benchmark_data_sg08-09_20674
sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg
lat(s)
0 0 0 0 0 0 -
0
1 16 400 384 1538.92 1539 0.0234887
0.0401751
2 16 824 808 1618.98 1699.31 0.0215378
0.0390947
3 16 1212 1196 1597.6 1555.03 0.0169675
0.0394463
4 16 1551 1535 1537.81 1358.65 0.0605608
0.0413843
5 16 1947 1931 1547.63 1587.09 0.0183851
0.0409588
6 16 2099 2083 1391.21 609.188 0.014569
0.0401053
7 16 2174 2158 1235.4 300.586 0.019966
0.0482541
8 16 2551 2535 1269.82 1510.95 0.0226075
0.0503494
9 16 2727 2711 1207.09 705.375 0.0168164
0.0486592
10 16 3015 2999 1201.79 1154.25 0.0242785
0.0531814
Total time run: 10.038395
Total writes made: 3015
Write size: 4202496
Object size: 4202496
Bandwidth (MB/sec): 1203.73
Stddev Bandwidth: 490.656
Max bandwidth (MB/sec): 1699.31
Min bandwidth (MB/sec): 300.586
Average IOPS: 300
Stddev IOPS: 122
Max IOPS: 424
Min IOPS: 75
Average Latency(s): 0.0532326
Stddev Latency(s): 0.145877
Max latency(s): 1.92627
Min latency(s): 0.0125254
#

# rados bench -p benchmark_erasure_coded 10 seq
hints = 1
sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg
lat(s)
0 0 0 0 0 0 -
0
1 15 432 417 1670.33 1671.26 0.0846208
0.0365352
2 16 856 840 1682.65 1695.3 0.0317332
0.0368908
3 16 1260 1244 1661.41 1619.16 0.0143683
0.0376721
4 16 1657 1641 1643.78 1591.1 0.0307071
0.0381582
5 16 2083 2067 1656.44 1707.33 0.0108411
0.0379383
6 15 2492 2477 1654.09 1643.2 0.0612322
0.03794
7 16 2897 2881 1649.07 1619.16 0.0646332
0.038115
Total time run: 7.352938
Total reads made: 3015
Read size: 4202496
Object size: 4202496
Bandwidth (MB/sec): 1643.36
Average IOPS: 410
Stddev IOPS: 10
Max IOPS: 426
Min IOPS: 397
Average Latency(s): 0.0383485
Max latency(s): 0.232213
Min latency(s): 0.0086436
#

# rados bench -p benchmark_erasure_coded 10 rand
hints = 1
sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg
lat(s)
0 0 0 0 0 0 -
0
1 15 443 428 1714.95 1715.34 0.149024
0.0358012
2 15 871 856 1715 1715.34 0.0134417
0.0360376
3 16 1314 1298 1733.73 1771.45 0.0096009
0.0359534
4 15 1739 1724 1727.07 1707.33 0.0186676
0.0362362
5 16 2146 2130 1707.04 1627.17 0.0320369
0.0367355
6 16 2577 2561 1710.39 1727.37 0.0112118
0.036779
7 16 2998 2982 1707.06 1687.29 0.0110222
0.0367852
8 16 3477 3461 1733.61 1919.74 0.0420042
0.0362533
9 16 3897 3881 1727.99 1683.28 0.0124048
0.0363658
10 16 4320 4304 1724.7 1695.3 0.011312
0.0364624
Total time run: 10.043213
Total reads made: 4320
Read size: 4202496
Object size: 4202496
Bandwidth (MB/sec): 1723.93
Average IOPS: 430
Stddev IOPS: 19
Max IOPS: 479
Min IOPS: 406
Average Latency(s): 0.0365779
Max latency(s): 0.261305
Min latency(s): 0.00883435
#






# rados bench -p volumes 10 write --no-cleanup
hints = 1
Maintaining 16 concurrent writes of 4194304 bytes to objects of size
4194304 for up to 10 seconds or 0 objects
Object prefix: benchmark_data_sg08-09_21005
sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg
lat(s)
0 0 0 0 0 0 -
0
1 16 324 308 1231.94 1232 0.0390295
0.049679
2 16 612 596 1191.88 1152 0.0245875
0.0527361
3 16 874 858 1143.87 1048 0.0223238
0.0553406
4 16 1144 1128 1127.86 1080 0.0715606
0.0558464
5 16 1388 1372 1097.46 976 0.0279251
0.0571859
6 16 1615 1599 1065.87 908 0.159919
0.0584934
7 16 1849 1833 1047.3 936 0.0167605
0.0601535
8 16 2034 2018 1008.87 740 0.0438302
0.0628943
9 16 2265 2249 999.427 924 0.035679
0.0632574
10 16 2499 2483 993.071 936 0.0244276
0.0640876
Total time run: 10.140100
Total writes made: 2500
Write size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 986.184
Stddev Bandwidth: 139.753
Max bandwidth (MB/sec): 1232
Min bandwidth (MB/sec): 740
Average IOPS: 246
Stddev IOPS: 34
Max IOPS: 308
Min IOPS: 185
Average Latency(s): 0.0645551
Stddev Latency(s): 0.0796484
Max latency(s): 0.860657
Min latency(s): 0.012328
#

# rados bench -p volumes 10 seq
hints = 1
sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg
lat(s)
0 0 0 0 0 0 -
0
1 15 520 505 2018.81 2020 0.12669
0.0305952
2 16 1021 1005 2009.25 2000 0.0270969
0.0310014
3 16 1527 1511 2014.08 2024 0.0567405
0.0307781
4 16 2038 2022 2021.5 2044 0.0113705
0.0307111
Total time run: 4.929016
Total reads made: 2500
Read size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 2028.8
Average IOPS: 507
Stddev IOPS: 4
Max IOPS: 511
Min IOPS: 500
Average Latency(s): 0.0308263
Max latency(s): 0.252516
Min latency(s): 0.00706628
#


# rados bench -p volumes 10 rand
hints = 1
sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg
lat(s)
0 0 0 0 0 0 -
0
1 16 535 519 2075.65 2076 0.0126651
0.029495
2 16 1035 1019 2037.64 2000 0.0522776
0.0302652
3 16 1570 1554 2071.65 2140 0.0255347
0.0301478
4 16 2092 2076 2075.66 2088 0.0201707
0.0301096
5 15 2627 2612 2089.27 2144 0.0361379
0.0299093
6 16 3148 3132 2087.69 2080 0.0506905
0.029938
7 16 3660 3644 2081.99 2048 0.0157957
0.0299908
8 16 4178 4162 2080.71 2072 0.0105906
0.0300153
9 16 4718 4702 2089.49 2160 0.0144684
0.0299537
10 16 5257 5241 2096.11 2156 0.0157336
0.029869
Total time run: 10.028729
Total reads made: 5257
Read size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 2096.78
Average IOPS: 524
Stddev IOPS: 13
Max IOPS: 540
Min IOPS: 500
Average Latency(s): 0.0298879
Max latency(s): 0.509513
Min latency(s): 0.00704302
#





# rados bench -p benchmark_replicated 10 write --no-cleanup
hints = 1
Maintaining 16 concurrent writes of 4194304 bytes to objects of size
4194304 for up to 10 seconds or 0 objects
Object prefix: benchmark_data_sg08-09_21089
sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg
lat(s)
0 0 0 0 0 0 -
0
1 16 218 202 807.933 808 0.0375772
0.0725827
2 15 418 403 805.897 804 0.0287712
0.0739311
3 16 585 569 758.578 664 0.0184545
0.0799763
4 16 766 750 749.915 724 0.0258367
0.078744
5 16 975 959 767.115 836 0.871661
0.0792439
6 16 1129 1113 741.915 616 0.0412471
0.0793611
7 16 1276 1260 719.916 588 0.0162774
0.0804685
8 15 1502 1487 743.415 908 0.0186694
0.0778871
9 16 1687 1671 742.583 736 0.0158112
0.0790996
10 16 1832 1816 726.316 580 1.83527
0.07856
11 15 1832 1817 660.652 4 0.120861
0.0785833
12 15 1832 1817 605.599 0 -
0.0785833
13 15 1832 1817 559.015 0 -
0.0785833
14 15 1832 1817 519.085 0 -
0.0785833
15 15 1832 1817 484.48 0 -
0.0785833
16 15 1832 1817 454.201 0 -
0.0785833
17 8 1832 1824 429.13 4.66667 8.16068
0.109725
Total time run: 17.755495
Total writes made: 1832
Write size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 412.717
Stddev Bandwidth: 377.473
Max bandwidth (MB/sec): 908
Min bandwidth (MB/sec): 0
Average IOPS: 103
Stddev IOPS: 94
Max IOPS: 227
Min IOPS: 0
Average Latency(s): 0.146787
Stddev Latency(s): 0.789571
Max latency(s): 13.2518
Min latency(s): 0.0117266
#

# rados bench -p benchmark_replicated 10 seq
hints = 1
sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg
lat(s)
0 0 0 0 0 0 -
0
1 16 496 480 1919.58 1920 0.0111955
0.0289794
2 15 979 964 1927.46 1936 0.0204766
0.0294553
3 16 1425 1409 1878.23 1780 0.0209146
0.027708
4 7 1832 1825 1824.63 1664 0.0286691
0.0330634
5 7 1832 1825 1459.73 0 -
0.0330634
Total time run: 5.593011
Total reads made: 1832
Read size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 1310.21
Average IOPS: 327
Stddev IOPS: 205
Max IOPS: 484
Min IOPS: 0
Average Latency(s): 0.0399277
Max latency(s): 2.54257
Min latency(s): 0.00632494
#

# rados bench -p benchmark_replicated 10 rand
hints = 1
sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg
lat(s)
0 0 0 0 0 0 -
0
1 15 515 500 1999.26 2000 0.0260058
0.0289045
2 16 1008 992 1983.45 1968 0.00914651
0.0297668
3 16 1512 1496 1994.23 2016 0.0163293
0.0309669
4 16 1996 1980 1979.6 1936 0.0123961
0.0313833
5 15 2486 2471 1976.43 1964 0.0318256
0.0312294
6 16 2992 2976 1983.64 2020 0.0346031
0.0313301
7 15 3498 3483 1989.94 2028 0.0119796
0.0314029
8 16 4018 4002 2000.65 2076 0.0374133
0.0312428
9 16 4558 4542 2018.33 2160 0.024143
0.0308669
10 15 5101 5086 2034.07 2176 0.0317191
0.0307552
Total time run: 10.032364
Total reads made: 5101
Read size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 2033.82
Average IOPS: 508
Stddev IOPS: 20
Max IOPS: 544
Min IOPS: 484
Average Latency(s): 0.0307879
Max latency(s): 1.3466
Min latency(s): 0.00688148
#

Regards
Jakub
Post by Jakub Jaszewski
Regarding split & merge, I have default values
filestore_merge_threshold = 10
filestore_split_multiple = 2
according to https://bugzilla.redhat.com/show_bug.cgi?id=1219974 the
recommended values are
filestore_merge_threshold = 40
filestore_split_multiple = 8
Is it something that I can easily change to default or lower values than
proposed in case of further performance degradation ?
I did tests of 4 pools: 2 replicated pools (x3 ) and 2 EC pools (k=6,m=3)
The pool with the lowest bandwidth has osd tree structure like
├── 20.115s1_head
│ └── DIR_5
│ └── DIR_1
│ ├── DIR_1
│ │ ├── DIR_0
│ │ ├── DIR_1
│ │ ├── DIR_2
│ │ │ ├── DIR_0
│ │ │ ├── DIR_1
│ │ │ ├── DIR_2
│ │ │ ├── DIR_3
│ │ │ ├── DIR_4
│ │ │ ├── DIR_5
│ │ │ ├── DIR_6
│ │ │ ├── DIR_7
│ │ │ ├── DIR_8
│ │ │ ├── DIR_9
│ │ │ ├── DIR_A
│ │ │ ├── DIR_B
│ │ │ ├── DIR_C
│ │ │ ├── DIR_D
│ │ │ ├── DIR_E
│ │ │ └── DIR_F
Tests results
# rados bench -p default.rgw.buckets.data 10 write
hints = 1
Maintaining 16 concurrent writes of 4194432 bytes to objects of size
4194432 for up to 10 seconds or 0 objects
Object prefix: benchmark_data_sg08-09_180679
sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg
lat(s)
0 0 0 0 0 0 -
0
1 16 129 113 451.975 452.014 0.0376714
0.128027
2 16 209 193 385.964 320.01 0.119609
0.138517
3 16 235 219 291.974 104.003 0.0337624
0.13731
4 16 235 219 218.981 0 -
0.13731
5 16 266 250 199.983 62.0019 0.111673
0.238424
6 16 317 301 200.649 204.006 0.0340569
0.298489
7 16 396 380 217.124 316.01 0.0379956
0.283458
8 16 444 428 213.981 192.006 0.0304383
0.274193
9 16 485 469 208.426 164.005 0.391956
0.283421
10 16 496 480 191.983 44.0013 0.104497
0.292074
11 16 497 481 174.894 4.00012 0.999985
0.293545
12 16 497 481 160.32 0 -
0.293545
13 16 497 481 147.987 0 -
0.293545
14 16 497 481 137.417 0 -
0.293545
Total time run: 14.493353
Total writes made: 497
Write size: 4194432
Object size: 4194432
Bandwidth (MB/sec): 137.171
Stddev Bandwidth: 147.001
Max bandwidth (MB/sec): 452.014
Min bandwidth (MB/sec): 0
Average IOPS: 34
Stddev IOPS: 36
Max IOPS: 113
Min IOPS: 0
Average Latency(s): 0.464281
Stddev Latency(s): 1.09388
Max latency(s): 6.3723
Min latency(s): 0.023835
Cleaning up (deleting benchmark objects)
Removed 497 objects
Clean up completed and total clean up time :10.622382
#
# rados bench -p benchmark_erasure_coded 10 write
hints = 1
Maintaining 16 concurrent writes of 4202496 bytes to objects of size
4202496 for up to 10 seconds or 0 objects
Object prefix: benchmark_data_sg08-09_180807
sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg
lat(s)
0 0 0 0 0 0 -
0
1 16 424 408 1635.11 1635.19 0.0490434
0.0379616
2 16 828 812 1627.03 1619.16 0.0616501
0.0388467
3 16 1258 1242 1659.06 1723.36 0.0304412
0.0384537
4 16 1659 1643 1646.03 1607.13 0.0155402
0.0387351
5 16 2053 2037 1632.61 1579.08 0.0453354
0.0390236
6 16 2455 2439 1629 1611.14 0.0485313
0.0392376
7 16 2649 2633 1507.34 777.516 0.0148972
0.0393161
8 16 2858 2842 1423.61 837.633 0.0157639
0.0449088
9 16 3245 3229 1437.75 1551.02 0.0200845
0.0444847
10 16 3629 3613 1447.85 1539 0.0654451
0.0441569
Total time run: 10.229591
Total writes made: 3630
Write size: 4202496
Object size: 4202496
Bandwidth (MB/sec): 1422.18
Stddev Bandwidth: 341.609
Max bandwidth (MB/sec): 1723.36
Min bandwidth (MB/sec): 777.516
Average IOPS: 354
Stddev IOPS: 85
Max IOPS: 430
Min IOPS: 194
Average Latency(s): 0.0448612
Stddev Latency(s): 0.0712224
Max latency(s): 1.08353
Min latency(s): 0.0134629
Cleaning up (deleting benchmark objects)
Removed 3630 objects
Clean up completed and total clean up time :2.321669
#
# rados bench -p volumes 10 write
hints = 1
Maintaining 16 concurrent writes of 4194304 bytes to objects of size
4194304 for up to 10 seconds or 0 objects
Object prefix: benchmark_data_sg08-09_180651
sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg
lat(s)
0 0 0 0 0 0 -
0
1 16 336 320 1279.89 1280 0.0309006
0.0472524
2 16 653 637 1273.84 1268 0.0465151
0.0495701
3 16 956 940 1253.17 1212 0.0337327
0.0504146
4 16 1256 1240 1239.85 1200 0.0177263
0.0509145
5 16 1555 1539 1231.05 1196 0.0364991
0.0516724
6 16 1868 1852 1234.51 1252 0.0260964
0.0510236
7 16 2211 2195 1254.13 1372 0.040738
0.050847
8 16 2493 2477 1238.35 1128 0.0228582
0.0514979
9 16 2838 2822 1254.07 1380 0.0265224
0.0508641
10 16 3116 3100 1239.85 1112 0.0160151
0.0513104
Total time run: 10.192091
Total writes made: 3117
Write size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 1223.3
Stddev Bandwidth: 89.9383
Max bandwidth (MB/sec): 1380
Min bandwidth (MB/sec): 1112
Average IOPS: 305
Stddev IOPS: 22
Max IOPS: 345
Min IOPS: 278
Average Latency(s): 0.0518144
Stddev Latency(s): 0.0529575
Max latency(s): 0.663523
Min latency(s): 0.0122169
Cleaning up (deleting benchmark objects)
Removed 3117 objects
Clean up completed and total clean up time :0.212296
#
# rados bench -p benchmark_replicated 10 write
hints = 1
Maintaining 16 concurrent writes of 4194304 bytes to objects of size
4194304 for up to 10 seconds or 0 objects
Object prefix: benchmark_data_sg08-09_180779
sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg
lat(s)
0 0 0 0 0 0 -
0
1 16 309 293 1171.94 1172 0.0233267
0.0508877
2 16 632 616 1231.87 1292 0.0258237
0.049612
3 16 959 943 1257.19 1308 0.0335615
0.0483464
4 16 1276 1260 1259.85 1268 0.031461
0.0504689
5 16 1643 1627 1301.44 1468 0.0274032
0.0489651
6 16 1991 1975 1316.51 1392 0.0408116
0.0483596
7 16 2328 2312 1320.98 1348 0.0242298
0.048175
8 16 2677 2661 1330.33 1396 0.097513
0.047962
9 16 3042 3026 1344.72 1460 0.0196724
0.0474078
10 16 3384 3368 1347.03 1368 0.0426199
0.0472573
Total time run: 10.482871
Total writes made: 3384
Write size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 1291.25
Stddev Bandwidth: 90.4861
Max bandwidth (MB/sec): 1468
Min bandwidth (MB/sec): 1172
Average IOPS: 322
Stddev IOPS: 22
Max IOPS: 367
Min IOPS: 293
Average Latency(s): 0.048763
Stddev Latency(s): 0.0547666
Max latency(s): 0.938211
Min latency(s): 0.0121556
Cleaning up (deleting benchmark objects)
Removed 3384 objects
Clean up completed and total clean up time :0.239684
#
Luis why did you advise against increasing pg_num pgp_num ? I'm wondering
which option is better: increasing pg_num or filestore_merge_threshold and
filestore_split_multiple ?
Thanks
Jakub
On Thu, Feb 1, 2018 at 9:38 AM, Jaroslaw Owsiewski <
Hi,
maybe "split is on the floor"?
Regards
--
Jarek
Piotr Dałek
2018-02-02 09:25:50 UTC
Permalink
Raw Message
Post by Jakub Jaszewski
Hi,
So I have changed merge & split settings to
filestore_merge_threshold = 40
filestore_split_multiple = 8
and restart all OSDs , host by host.
Let me ask a question, although the pool default.rgw.buckets.data that was
affected prior to the above change has higher write bandwidth it is very
random now. Writes are random for other pools (same for EC and replicated
types) too, before the change writes to replicated pools were much more stable.
Reads from pools look fine and stable.
Is it the result of mentioned change ? Is PG directory structure updating or
...?
The HUGE problem with filestore is that it can't handle large number of
small objects well. Sure, if the number only grows slowly (case with RBD
images) then it's probably not that noticeable, but in case of 31 millions
of objects that come and go at random pace you're going to hit frequent
problems with filestore collections splitting and merging. Pre-Luminous, it
happened on all osds hosting particular collection at once, and in Luminous
there's "filestore split rand factor" which according to docs:

Description: A random factor added to the split threshold to avoid
too many filestore splits occurring at once. See
``filestore split multiple`` for details.
This can only be changed for an existing osd offline,
via ceph-objectstore-tool's apply-layout-settings command.

You may want to try the above as well.
--
Piotr Dałek
***@corp.ovh.com
https://www.ovh.com/us/
Frédéric Nass
2018-02-02 16:00:01 UTC
Permalink
Raw Message
Hi,

Split and merge operations happen during writes only, splitting on file
creation and merging on file deletion.

As you don't see any blocked requests during reads I would guess your
issue happens during splitting. Now that you increased
filestore_merge_threshold and filestore_split_multiple, you shouldn't
expect any splitting operations to happen any soon, nor any merging
operations, unless your workload consists of writing a huge number of
files and removing them.

You should check how many files are in each lower directories of pool
20's PGs. This would help to confirm that the blocked requests come with
the splitting.

We now use the below script (on one of the OSD nodes) to get an average
value of the number of files in some PGs of each pool and run this
script every 5 minutes with Munin to get a graph out of the values.
This way, we can anticipate the splitting and even provoke the splitting
during off hours with rados bench (prefix) :

#!/bin/bash

for pool in $(ceph osd pool ls detail | grep -v -E 'snap|^$' | cut -d "
" -f2-3 | sed -e 's/ /|/g') ; do

pool_id=$(echo $pool | cut -d "|" -f1)
pool_name=$(echo $pool | cut -d "|" -f2 | sed -e "s/'//g")
nbr_objects=$(rados df | grep "$pool_name " | awk '{print $3}')
#echo "$pool_id, $pool_name |$nbr_objects|"

ls -d /var/lib/ceph/osd/ceph-*/current/$pool_id.*_head > /dev/null 2>&1

#echo $?
#continue

if [ $? -eq 0 ]; then
avg=$(for pg_dir in $(ls -d
/var/lib/ceph/osd/ceph-*/current/$pool_id.*_head | tail -5) ; do find
$pg_dir -type d -print0 | while read -d '' -r dir; do files=`find $dir
-type f -maxdepth 1|wc -l`; printf "$files in $dir\n";done | grep -v '^0
' | awk '{ sum += $1; n++ } END { if (n > 0) print sum / n; }' ; done |
awk '{ sum += $1; n++ } END { if (n > 0) print sum / n; }')
else
avg=0
fi

echo "id=$pool_id, pool=$pool_name, objects=$nbr_objects, avg=$avg"

done

With 40/8 values, subdirectory splitting will happen when the number of
files goes over 5120 and merging when the number of files goes below 40.
My assumption is that merging can also happen to intermediate
directories when then go below 40 files, but I'm not sure about that.

If you ever need to provoke splitting on a pool, you can do this with a
rados bench (don't forget the prefix so you can easily remove the files
aftewards if it turns out that some were left in the pool).
You can also do the splitting offline if you need it:

systemctl stop ceph-osd@${osd_num}
ceph-osd -i ${osd_num} --flush-journal
ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-${osd_num}
--journal-path /var/lib/ceph/osd/ceph-${osd_num}/journal
--log-file=/var/log/ceph/objectstore_tool.${osd_num}.log --op
apply-layout-settings --pool ${pool_name}

Regards,

Frederic.
Post by Jakub Jaszewski
Hi,
So I have changed merge & split settings to
filestore_merge_threshold = 40
filestore_split_multiple = 8
and restart all OSDs , host by host.
Let me ask a question, although the pool default.rgw.buckets.data that
was affected prior to the above change has higher write bandwidth it
is very random now. Writes are random for other pools (same for EC and
replicated types) too, before the change writes to replicated pools
were much more stable.
Reads from pools look fine and stable.
Is it the result of mentioned change ? Is PG directory structure
updating or ...?
# rados bench -p default.rgw.buckets.data 10 write --no-cleanup
hints = 1
Maintaining 16 concurrent writes of 4194432 bytes to objects of size
4194432 for up to 10 seconds or 0 objects
Object prefix: benchmark_data_sg08-09_19744
sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg
lat(s)
  0       0         0         0         0         0    -           0
  1      16       169       153   611.982   612.019  0.0547835    0.096198
  2      16       314       298   595.958   580.018  0.0485904    0.103653
  3      16       454       438   583.956   560.017  0.0747779    0.106501
  4      16       556       540    539.96   408.012 0.111734    0.110132
  5      16       624       608   486.364   272.008 0.111169    0.120221
  6      16       651       635   423.302   108.003  0.0706783    0.126086
  7      16       730       714    407.97    316.01 0.362838    0.153117
  8      16       802       786    392.97   288.009  0.0261249    0.154728
  9      16       858       842   374.194   224.007  0.0703766    0.159723
 10      16       913       897   358.773   220.007 0.169544    0.173459
Total time run:         10.386646
Total writes made:      913
Write size:             4194432
Object size:            4194432
Bandwidth (MB/sec):     351.616
Stddev Bandwidth:       173.421
Max bandwidth (MB/sec): 612.019
Min bandwidth (MB/sec): 108.003
Average IOPS:           87
Stddev IOPS:            43
Max IOPS:               153
Min IOPS:               27
Average Latency(s):     0.179969
Stddev Latency(s):      0.321098
Max latency(s):         2.60469
Min latency(s):         0.0209669
# rados bench -p default.rgw.buckets.data 10 seq
hints = 1
sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg
lat(s)
  0       0         0         0         0         0    -           0
  1      15       470       455   1818.75   1820.06  0.0462129   0.0337452
eTotal time run:       1.960388
Total reads made:     913
Read size:            4194432
Object size:          4194432
Bandwidth (MB/sec):   1862.95
Average IOPS:         465
Stddev IOPS:          0
Max IOPS:             455
Min IOPS:             455
Average Latency(s):   0.0335775
Max latency(s):       0.259385
Min latency(s):       0.0169049
#
# rados bench -p default.rgw.buckets.data 10 rand
hints = 1
  sec Cur ops  started  finished  avg MB/s  cur MB/s last lat(s)  avg
lat(s)
    0       0  0         0         0         0           -           0
    1      15  476       461   1843.81   1844.06   0.0272928  0.0335396
    2      15  954       939   1877.83   1912.06   0.0434087  0.0332626
    3      16 1436      1420   1893.16   1924.06   0.0408819  0.0330526
    4      15 1925      1910   1909.84   1960.06   0.0259839  0.0328248
    5      15 2408      2393   1914.25   1932.06   0.0427577  0.0328096
    6      16 2891      2875   1916.52   1928.06   0.0273633  0.0327565
    7      15 3377      3362      1921   1948.06   0.0294795 0.032736
    8      16 3875      3859   1929.36   1988.06   0.0242193  0.03258
    9      16 4351      4335   1926.53   1904.06   0.0203889  0.0326336
   10      16 4855      4839   1935.46   2016.06   0.0325821  0.0324852
Total time run:  10.032589
Total reads made:  4856
Read size: 4194432
Object size: 4194432
Bandwidth (MB/sec):  1936.15
Average IOPS:  484
Stddev IOPS: 11
Max IOPS:  504
Min IOPS:  461
Average Latency(s):  0.0325118
Max latency(s):  0.250729
Min latency(s):  0.0149525
#
# rados bench -p benchmark_erasure_coded 10 write --no-cleanup
hints = 1
Maintaining 16 concurrent writes of 4202496 bytes to objects of size
4202496 for up to 10 seconds or 0 objects
Object prefix: benchmark_data_sg08-09_20674
sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg
lat(s)
  0       0         0         0         0         0    -           0
  1      16       400       384   1538.92      1539  0.0234887   0.0401751
  2      16       824       808   1618.98   1699.31  0.0215378   0.0390947
  3      16      1212      1196    1597.6   1555.03  0.0169675   0.0394463
  4      16      1551      1535   1537.81   1358.65  0.0605608   0.0413843
  5      16      1947      1931   1547.63   1587.09  0.0183851   0.0409588
  6      16      2099      2083   1391.21   609.188 0.014569   0.0401053
  7      16      2174      2158    1235.4   300.586 0.019966   0.0482541
  8      16      2551      2535   1269.82   1510.95  0.0226075   0.0503494
  9      16      2727      2711   1207.09   705.375  0.0168164   0.0486592
 10      16      3015      2999   1201.79   1154.25  0.0242785   0.0531814
Total time run:         10.038395
Total writes made:      3015
Write size:             4202496
Object size:            4202496
Bandwidth (MB/sec):     1203.73
Stddev Bandwidth:       490.656
Max bandwidth (MB/sec): 1699.31
Min bandwidth (MB/sec): 300.586
Average IOPS:           300
Stddev IOPS:            122
Max IOPS:               424
Min IOPS:               75
Average Latency(s):     0.0532326
Stddev Latency(s):      0.145877
Max latency(s):         1.92627
Min latency(s):         0.0125254
#
# rados bench -p benchmark_erasure_coded 10 seq
hints = 1
sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg
lat(s)
  0       0         0         0         0         0    -           0
  1      15       432       417   1670.33   1671.26  0.0846208   0.0365352
  2      16       856       840   1682.65    1695.3  0.0317332   0.0368908
  3      16      1260      1244   1661.41   1619.16  0.0143683   0.0376721
  4      16      1657      1641   1643.78    1591.1  0.0307071   0.0381582
  5      16      2083      2067   1656.44   1707.33  0.0108411   0.0379383
  6      15      2492      2477   1654.09    1643.2  0.0612322     0.03794
  7      16      2897      2881   1649.07   1619.16  0.0646332    0.038115
Total time run:       7.352938
Total reads made:     3015
Read size:            4202496
Object size:          4202496
Bandwidth (MB/sec):   1643.36
Average IOPS:         410
Stddev IOPS:          10
Max IOPS:             426
Min IOPS:             397
Average Latency(s):   0.0383485
Max latency(s):       0.232213
Min latency(s):       0.0086436
#
# rados bench -p benchmark_erasure_coded 10 rand
hints = 1
sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg
lat(s)
  0       0         0         0         0         0    -           0
  1      15       443       428   1714.95   1715.34 0.149024   0.0358012
  2      15       871       856      1715   1715.34  0.0134417   0.0360376
  3      16      1314      1298   1733.73   1771.45  0.0096009   0.0359534
  4      15      1739      1724   1727.07   1707.33  0.0186676   0.0362362
  5      16      2146      2130   1707.04   1627.17  0.0320369   0.0367355
  6      16      2577      2561   1710.39   1727.37  0.0112118    0.036779
  7      16      2998      2982   1707.06   1687.29  0.0110222   0.0367852
  8      16      3477      3461   1733.61   1919.74  0.0420042   0.0362533
  9      16      3897      3881   1727.99   1683.28  0.0124048   0.0363658
 10      16      4320      4304    1724.7    1695.3 0.011312   0.0364624
Total time run:       10.043213
Total reads made:     4320
Read size:            4202496
Object size:          4202496
Bandwidth (MB/sec):   1723.93
Average IOPS:         430
Stddev IOPS:          19
Max IOPS:             479
Min IOPS:             406
Average Latency(s):   0.0365779
Max latency(s):       0.261305
Min latency(s):       0.00883435
#
# rados bench -p volumes 10 write --no-cleanup
hints = 1
Maintaining 16 concurrent writes of 4194304 bytes to objects of size
4194304 for up to 10 seconds or 0 objects
Object prefix: benchmark_data_sg08-09_21005
  sec Cur ops   started finished  avg MB/s  cur MB/s last lat(s)  avg
lat(s)
    0       0         0  0         0         0           -           0
    1      16       324  308   1231.94      1232   0.0390295    0.049679
    2      16       612  596   1191.88      1152   0.0245875   0.0527361
    3      16       874  858   1143.87      1048   0.0223238   0.0553406
    4      16      1144 1128   1127.86      1080   0.0715606   0.0558464
    5      16      1388 1372   1097.46       976   0.0279251   0.0571859
    6      16      1615 1599   1065.87       908    0.159919   0.0584934
    7      16      1849 1833    1047.3       936   0.0167605   0.0601535
    8      16      2034 2018   1008.87       740   0.0438302   0.0628943
    9      16      2265 2249   999.427       924    0.035679   0.0632574
   10      16      2499 2483   993.071       936   0.0244276   0.0640876
Total time run:  10.140100
Total writes made:      2500
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     986.184
Stddev Bandwidth:       139.753
Max bandwidth (MB/sec): 1232
Min bandwidth (MB/sec): 740
Average IOPS:           246
Stddev IOPS:            34
Max IOPS:               308
Min IOPS:               185
Average Latency(s):  0.0645551
Stddev Latency(s): 0.0796484
Max latency(s):  0.860657
Min latency(s):  0.012328
#
# rados bench -p volumes 10 seq
hints = 1
  sec Cur ops   started finished  avg MB/s  cur MB/s last lat(s)  avg
lat(s)
    0       0         0  0         0         0           -           0
    1      15       520  505   2018.81      2020     0.12669   0.0305952
    2      16      1021 1005   2009.25      2000   0.0270969   0.0310014
    3      16      1527 1511   2014.08      2024   0.0567405   0.0307781
    4      16      2038 2022    2021.5      2044   0.0113705   0.0307111
Total time run:       4.929016
Total reads made:     2500
Read size:            4194304
Object size:          4194304
Bandwidth (MB/sec):   2028.8
Average IOPS:         507
Stddev IOPS:          4
Max IOPS:             511
Min IOPS:             500
Average Latency(s):   0.0308263
Max latency(s):       0.252516
Min latency(s):  0.00706628
#
# rados bench -p volumes 10 rand
hints = 1
  sec Cur ops   started finished  avg MB/s  cur MB/s last lat(s)  avg
lat(s)
    0       0         0  0         0         0           -           0
    1      16       535  519   2075.65      2076   0.0126651    0.029495
    2      16      1035 1019   2037.64      2000   0.0522776   0.0302652
    3      16      1570 1554   2071.65      2140   0.0255347   0.0301478
    4      16      2092 2076   2075.66      2088   0.0201707   0.0301096
    5      15      2627 2612   2089.27      2144   0.0361379   0.0299093
    6      16      3148 3132   2087.69      2080   0.0506905    0.029938
    7      16      3660 3644   2081.99      2048   0.0157957   0.0299908
    8      16      4178 4162   2080.71      2072   0.0105906   0.0300153
    9      16      4718 4702   2089.49      2160   0.0144684   0.0299537
   10      16      5257 5241   2096.11      2156   0.0157336    0.029869
Total time run:       10.028729
Total reads made:     5257
Read size:            4194304
Object size:          4194304
Bandwidth (MB/sec):   2096.78
Average IOPS:         524
Stddev IOPS:          13
Max IOPS:             540
Min IOPS:             500
Average Latency(s):   0.0298879
Max latency(s):       0.509513
Min latency(s):  0.00704302
#
# rados bench -p benchmark_replicated 10 write --no-cleanup
hints = 1
Maintaining 16 concurrent writes of 4194304 bytes to objects of size
4194304 for up to 10 seconds or 0 objects
Object prefix: benchmark_data_sg08-09_21089
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg
lat(s)
    0       0         0         0         0  0           -           0
    1      16       218       202   807.933  808   0.0375772   0.0725827
    2      15       418       403   805.897  804   0.0287712   0.0739311
    3      16       585       569   758.578  664   0.0184545   0.0799763
    4      16       766       750   749.915  724   0.0258367    0.078744
    5      16       975       959   767.115  836    0.871661   0.0792439
    6      16      1129      1113   741.915  616   0.0412471   0.0793611
    7      16      1276      1260   719.916  588   0.0162774   0.0804685
    8      15      1502      1487   743.415  908   0.0186694   0.0778871
    9      16      1687      1671   742.583  736   0.0158112   0.0790996
   10      16      1832      1816   726.316  580     1.83527     0.07856
   11      15      1832      1817   660.652  4    0.120861   0.0785833
   12      15      1832      1817   605.599  0           -   0.0785833
   13      15      1832      1817   559.015  0           -   0.0785833
   14      15      1832      1817   519.085  0           -   0.0785833
   15      15      1832      1817    484.48  0           -   0.0785833
   16      15      1832      1817   454.201  0           -   0.0785833
   17       8      1832      1824    429.13  4.66667     8.16068   
0.109725
Total time run:         17.755495
Total writes made:      1832
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     412.717
Stddev Bandwidth:       377.473
Max bandwidth (MB/sec): 908
Min bandwidth (MB/sec): 0
Average IOPS:           103
Stddev IOPS:            94
Max IOPS:               227
Min IOPS:               0
Average Latency(s):     0.146787
Stddev Latency(s):      0.789571
Max latency(s):         13.2518
Min latency(s):         0.0117266
#
# rados bench -p benchmark_replicated 10 seq
hints = 1
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg
lat(s)
    0       0         0         0         0  0           -           0
    1      16       496       480   1919.58 1920   0.0111955   0.0289794
    2      15       979       964   1927.46 1936   0.0204766   0.0294553
    3      16      1425      1409   1878.23 1780   0.0209146    0.027708
    4       7      1832      1825   1824.63 1664   0.0286691   0.0330634
    5       7      1832      1825   1459.73  0           -   0.0330634
Total time run:       5.593011
Total reads made:     1832
Read size:            4194304
Object size:          4194304
Bandwidth (MB/sec):   1310.21
Average IOPS:         327
Stddev IOPS:          205
Max IOPS:             484
Min IOPS:             0
Average Latency(s):   0.0399277
Max latency(s):       2.54257
Min latency(s):       0.00632494
#
# rados bench -p benchmark_replicated 10 rand
hints = 1
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg
lat(s)
    0       0         0         0         0  0           -           0
    1      15       515       500   1999.26 2000   0.0260058   0.0289045
    2      16      1008       992   1983.45 1968  0.00914651   0.0297668
    3      16      1512      1496   1994.23 2016   0.0163293   0.0309669
    4      16      1996      1980    1979.6 1936   0.0123961   0.0313833
    5      15      2486      2471   1976.43 1964   0.0318256   0.0312294
    6      16      2992      2976   1983.64 2020   0.0346031   0.0313301
    7      15      3498      3483   1989.94 2028   0.0119796   0.0314029
    8      16      4018      4002   2000.65 2076   0.0374133   0.0312428
    9      16      4558      4542   2018.33 2160    0.024143   0.0308669
   10      15      5101      5086   2034.07 2176   0.0317191   0.0307552
Total time run:       10.032364
Total reads made:     5101
Read size:            4194304
Object size:          4194304
Bandwidth (MB/sec):   2033.82
Average IOPS:         508
Stddev IOPS:          20
Max IOPS:             544
Min IOPS:             484
Average Latency(s):   0.0307879
Max latency(s):       1.3466
Min latency(s):       0.00688148
#
Regards
Jakub
On Thu, Feb 1, 2018 at 3:33 PM, Jakub Jaszewski
Regarding split & merge, I have default values
filestore_merge_threshold = 10
filestore_split_multiple = 2
according to https://bugzilla.redhat.com/show_bug.cgi?id=1219974
<https://bugzilla.redhat.com/show_bug.cgi?id=1219974> the
recommended values are
filestore_merge_threshold = 40
filestore_split_multiple = 8
Is it something that I can easily change to default or lower
values than proposed in case of further performance degradation ?
I did tests of 4 pools: 2 replicated pools (x3 ) and 2 EC  pools
(k=6,m=3)
The pool with the lowest bandwidth has osd tree structure like
├── 20.115s1_head
│   └── DIR_5
│       └── DIR_1
│           ├── DIR_1
│           │   ├── DIR_0
│           │   ├── DIR_1
│           │   ├── DIR_2
│           │   │   ├── DIR_0
│           │   │   ├── DIR_1
│           │   │   ├── DIR_2
│           │   │   ├── DIR_3
│           │   │   ├── DIR_4
│           │   │   ├── DIR_5
│           │   │   ├── DIR_6
│           │   │   ├── DIR_7
│           │   │   ├── DIR_8
│           │   │   ├── DIR_9
│           │   │   ├── DIR_A
│           │   │   ├── DIR_B
│           │   │   ├── DIR_C
│           │   │   ├── DIR_D
│           │   │   ├── DIR_E
│           │   │   └── DIR_F
Tests results
# rados bench -p default.rgw.buckets.data 10 write
hints = 1
Maintaining 16 concurrent writes of 4194432 bytes to objects of
size 4194432 for up to 10 seconds or 0 objects
Object prefix: benchmark_data_sg08-09_180679
  sec Cur ops  started  finished  avg MB/s  cur MB/s last lat(s) 
avg lat(s)
    0       0  0         0         0         0           -          0
    1      16  129       113   451.975   452.014  0.0376714   
0.128027
    2      16  209       193   385.964    320.01 0.119609    0.138517
    3      16  235       219   291.974   104.003  0.0337624   
 0.13731
    4      16  235       219   218.981         0  -     0.13731
    5      16  266       250   199.983   62.0019 0.111673    0.238424
    6      16  317       301   200.649   204.006  0.0340569   
0.298489
    7      16  396       380   217.124    316.01  0.0379956   
0.283458
    8      16  444       428   213.981   192.006  0.0304383   
0.274193
    9      16  485       469   208.426   164.005 0.391956    0.283421
   10      16  496       480   191.983   44.0013 0.104497    0.292074
   11      16  497       481   174.894   4.00012 0.999985    0.293545
   12      16  497       481    160.32         0  -    0.293545
   13      16  497       481   147.987         0  -    0.293545
   14      16  497       481   137.417         0  -    0.293545
Total time run:    14.493353
Total writes made:   497
Write size:    4194432
Object size:   4194432
Bandwidth (MB/sec):    137.171
Stddev Bandwidth:    147.001
Max bandwidth (MB/sec): 452.014
Min bandwidth (MB/sec): 0
Average IOPS:    34
Stddev IOPS:   36
Max IOPS:    113
Min IOPS:    0
Average Latency(s):    0.464281
Stddev Latency(s):   1.09388
Max latency(s):    6.3723
Min latency(s):    0.023835
Cleaning up (deleting benchmark objects)
Removed 497 objects
Clean up completed and total clean up time :10.622382
#
# rados bench -p benchmark_erasure_coded 10 write
hints = 1
Maintaining 16 concurrent writes of 4202496 bytes to objects of
size 4202496 for up to 10 seconds or 0 objects
Object prefix: benchmark_data_sg08-09_180807
  sec Cur ops  started  finished  avg MB/s  cur MB/s last lat(s) 
avg lat(s)
    0       0  0         0         0         0           -          0
    1      16  424       408   1635.11   1635.19  0.0490434 
 0.0379616
    2      16  828       812   1627.03   1619.16  0.0616501 
 0.0388467
    3      16 1258      1242   1659.06   1723.36  0.0304412 
 0.0384537
    4      16 1659      1643   1646.03   1607.13  0.0155402 
 0.0387351
    5      16 2053      2037   1632.61   1579.08  0.0453354 
 0.0390236
    6      16 2455      2439      1629   1611.14  0.0485313 
 0.0392376
    7      16 2649      2633   1507.34   777.516  0.0148972 
 0.0393161
    8      16 2858      2842   1423.61   837.633  0.0157639 
 0.0449088
    9      16 3245      3229   1437.75   1551.02  0.0200845 
 0.0444847
   10      16 3629      3613   1447.85      1539  0.0654451 
 0.0441569
Total time run:    10.229591
Total writes made:   3630
Write size:    4202496
Object size:   4202496
Bandwidth (MB/sec):    1422.18
Stddev Bandwidth:    341.609
Max bandwidth (MB/sec): 1723.36
Min bandwidth (MB/sec): 777.516
Average IOPS:    354
Stddev IOPS:   85
Max IOPS:    430
Min IOPS:    194
Average Latency(s):    0.0448612
Stddev Latency(s):   0.0712224
Max latency(s):    1.08353
Min latency(s):    0.0134629
Cleaning up (deleting benchmark objects)
Removed 3630 objects
Clean up completed and total clean up time :2.321669
#
# rados bench -p volumes 10 write
hints = 1
Maintaining 16 concurrent writes of 4194304 bytes to objects of
size 4194304 for up to 10 seconds or 0 objects
Object prefix: benchmark_data_sg08-09_180651
  sec Cur ops  started  finished  avg MB/s  cur MB/s last lat(s) 
avg lat(s)
    0       0  0         0         0         0           -          0
    1      16  336       320   1279.89      1280  0.0309006 
 0.0472524
    2      16  653       637   1273.84      1268  0.0465151 
 0.0495701
    3      16  956       940   1253.17      1212  0.0337327 
 0.0504146
    4      16 1256      1240   1239.85      1200  0.0177263 
 0.0509145
    5      16 1555      1539   1231.05      1196  0.0364991 
 0.0516724
    6      16 1868      1852   1234.51      1252  0.0260964 
 0.0510236
    7      16 2211      2195   1254.13      1372 0.040738    0.050847
    8      16 2493      2477   1238.35      1128  0.0228582 
 0.0514979
    9      16 2838      2822   1254.07      1380  0.0265224 
 0.0508641
   10      16 3116      3100   1239.85      1112  0.0160151 
 0.0513104
Total time run:    10.192091
Total writes made:   3117
Write size:    4194304
Object size:   4194304
Bandwidth (MB/sec):    1223.3
Stddev Bandwidth:    89.9383
Max bandwidth (MB/sec): 1380
Min bandwidth (MB/sec): 1112
Average IOPS:    305
Stddev IOPS:   22
Max IOPS:    345
Min IOPS:    278
Average Latency(s):    0.0518144
Stddev Latency(s):   0.0529575
Max latency(s):    0.663523
Min latency(s):    0.0122169
Cleaning up (deleting benchmark objects)
Removed 3117 objects
Clean up completed and total clean up time :0.212296
#
# rados bench -p benchmark_replicated  10 write
hints = 1
Maintaining 16 concurrent writes of 4194304 bytes to objects of
size 4194304 for up to 10 seconds or 0 objects
Object prefix: benchmark_data_sg08-09_180779
  sec Cur ops  started  finished  avg MB/s  cur MB/s last lat(s) 
avg lat(s)
    0       0  0         0         0         0           -          0
    1      16  309       293   1171.94      1172  0.0233267 
 0.0508877
    2      16  632       616   1231.87      1292  0.0258237   
0.049612
    3      16  959       943   1257.19      1308  0.0335615 
 0.0483464
    4      16 1276      1260   1259.85      1268 0.031461   0.0504689
    5      16 1643      1627   1301.44      1468  0.0274032 
 0.0489651
    6      16 1991      1975   1316.51      1392  0.0408116 
 0.0483596
    7      16 2328      2312   1320.98      1348  0.0242298   
0.048175
    8      16 2677      2661   1330.33      1396 0.097513    0.047962
    9      16 3042      3026   1344.72      1460  0.0196724 
 0.0474078
   10      16 3384      3368   1347.03      1368  0.0426199 
 0.0472573
Total time run:    10.482871
Total writes made:   3384
Write size:    4194304
Object size:   4194304
Bandwidth (MB/sec):    1291.25
Stddev Bandwidth:    90.4861
Max bandwidth (MB/sec): 1468
Min bandwidth (MB/sec): 1172
Average IOPS:    322
Stddev IOPS:   22
Max IOPS:    367
Min IOPS:    293
Average Latency(s):    0.048763
Stddev Latency(s):   0.0547666
Max latency(s):    0.938211
Min latency(s):    0.0121556
Cleaning up (deleting benchmark objects)
Removed 3384 objects
Clean up completed and total clean up time :0.239684
#
Luis why did you advise against increasing pg_num pgp_num ? I'm
wondering which option is better: increasing pg_num or
filestore_merge_threshold and filestore_split_multiple ?
Thanks
Jakub
On Thu, Feb 1, 2018 at 9:38 AM, Jaroslaw Owsiewski
Hi,
maybe "split  is on the floor"?
Regards
--
Jarek
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Frédéric Nass
2018-02-05 08:56:48 UTC
Permalink
Raw Message
Hi,

In addition, starting with Luminous 12.2.1 (RHCS 3), splitting ops
should be loggued with default setting of debug level messages:
https://github.com/ceph/ceph/blob/v12.2.1/src/os/filestore/HashIndex.cc#L320
There's also a RFE for merging to be loggued as well as splitting:
https://bugzilla.redhat.com/show_bug.cgi?id=1523532

Regards,

Frédéric.
Post by Frédéric Nass
Hi,
Split and merge operations happen during writes only, splitting on
file creation and merging on file deletion.
As you don't see any blocked requests during reads I would guess your
issue happens during splitting. Now that you increased
filestore_merge_threshold and filestore_split_multiple, you shouldn't
expect any splitting operations to happen any soon, nor any merging
operations, unless your workload consists of writing a huge number of
files and removing them.
You should check how many files are in each lower directories of pool
20's PGs. This would help to confirm that the blocked requests come
with the splitting.
We now use the below script (on one of the OSD nodes) to get an
average value of the number of files in some PGs of each pool and run
this script every 5 minutes with Munin to get a graph out of the values.
This way, we can anticipate the splitting and even provoke the
#!/bin/bash
for pool in $(ceph osd pool ls detail | grep -v -E 'snap|^$' | cut -d
" " -f2-3 | sed -e 's/ /|/g') ; do
pool_id=$(echo $pool | cut -d "|" -f1)
pool_name=$(echo $pool | cut -d "|" -f2 | sed -e "s/'//g")
nbr_objects=$(rados df | grep "$pool_name " | awk '{print $3}')
#echo "$pool_id, $pool_name |$nbr_objects|"
ls -d /var/lib/ceph/osd/ceph-*/current/$pool_id.*_head > /dev/null 2>&1
#echo $?
#continue
if [ $? -eq 0 ]; then
avg=$(for pg_dir in $(ls -d
/var/lib/ceph/osd/ceph-*/current/$pool_id.*_head | tail -5) ; do find
$pg_dir -type d -print0 | while read -d '' -r dir; do files=`find $dir
-type f -maxdepth 1|wc -l`; printf "$files in $dir\n";done | grep -v
'^0 ' | awk '{ sum += $1; n++ } END { if (n > 0) print sum / n; }' ;
done | awk '{ sum += $1; n++ } END { if (n > 0) print sum / n; }')
else
avg=0
fi
echo "id=$pool_id, pool=$pool_name, objects=$nbr_objects, avg=$avg"
done
With 40/8 values, subdirectory splitting will happen when the number
of files goes over 5120 and merging when the number of files goes
below 40. My assumption is that merging can also happen to
intermediate directories when then go below 40 files, but I'm not sure
about that.
If you ever need to provoke splitting on a pool, you can do this with
a rados bench (don't forget the prefix so you can easily remove the
files aftewards if it turns out that some were left in the pool).
ceph-osd -i ${osd_num} --flush-journal
ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-${osd_num}
--journal-path /var/lib/ceph/osd/ceph-${osd_num}/journal
--log-file=/var/log/ceph/objectstore_tool.${osd_num}.log --op
apply-layout-settings --pool ${pool_name}
Regards,
Frederic.
Post by Jakub Jaszewski
Hi,
So I have changed merge & split settings to
filestore_merge_threshold = 40
filestore_split_multiple = 8
and restart all OSDs , host by host.
Let me ask a question, although the pool default.rgw.buckets.data
that was affected prior to the above change has higher write
bandwidth it is very random now. Writes are random for other pools
(same for EC and replicated types) too, before the change writes to
replicated pools were much more stable.
Reads from pools look fine and stable.
Is it the result of mentioned change ? Is PG directory structure
updating or ...?
# rados bench -p default.rgw.buckets.data 10 write --no-cleanup
hints = 1
Maintaining 16 concurrent writes of 4194432 bytes to objects of size
4194432 for up to 10 seconds or 0 objects
Object prefix: benchmark_data_sg08-09_19744
sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg
lat(s)
  0       0         0         0         0         0      -           0
  1      16       169       153   611.982   612.019  0.0547835   
0.096198
  2      16       314       298   595.958   580.018  0.0485904   
0.103653
  3      16       454       438   583.956   560.017  0.0747779   
0.106501
  4      16       556       540    539.96   408.012 0.111734    0.110132
  5      16       624       608   486.364   272.008 0.111169    0.120221
  6      16       651       635   423.302   108.003  0.0706783   
0.126086
  7      16       730       714    407.97    316.01 0.362838    0.153117
  8      16       802       786    392.97   288.009  0.0261249   
0.154728
  9      16       858       842   374.194   224.007  0.0703766   
0.159723
 10      16       913       897   358.773   220.007 0.169544    0.173459
Total time run:         10.386646
Total writes made:      913
Write size:             4194432
Object size:            4194432
Bandwidth (MB/sec):     351.616
Stddev Bandwidth:       173.421
Max bandwidth (MB/sec): 612.019
Min bandwidth (MB/sec): 108.003
Average IOPS:           87
Stddev IOPS:            43
Max IOPS:               153
Min IOPS:               27
Average Latency(s):     0.179969
Stddev Latency(s):      0.321098
Max latency(s):         2.60469
Min latency(s):         0.0209669
# rados bench -p default.rgw.buckets.data 10 seq
hints = 1
sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg
lat(s)
  0       0         0         0         0         0      -           0
  1      15       470       455   1818.75   1820.06  0.0462129 
 0.0337452
eTotal time run:       1.960388
Total reads made:     913
Read size:            4194432
Object size:          4194432
Bandwidth (MB/sec):   1862.95
Average IOPS:         465
Stddev IOPS:          0
Max IOPS:             455
Min IOPS:             455
Average Latency(s):   0.0335775
Max latency(s):       0.259385
Min latency(s):       0.0169049
#
# rados bench -p default.rgw.buckets.data 10 rand
hints = 1
  sec Cur ops  started  finished  avg MB/s  cur MB/s last lat(s) avg
lat(s)
    0       0    0         0         0         0           -    0
    1      15  476       461   1843.81   1844.06   0.0272928  0.0335396
    2      15  954       939   1877.83   1912.06   0.0434087  0.0332626
    3      16 1436      1420   1893.16   1924.06   0.0408819  0.0330526
    4      15 1925      1910   1909.84   1960.06   0.0259839  0.0328248
    5      15 2408      2393   1914.25   1932.06   0.0427577  0.0328096
    6      16 2891      2875   1916.52   1928.06   0.0273633  0.0327565
    7      15 3377      3362      1921   1948.06   0.0294795 0.032736
    8      16 3875      3859   1929.36   1988.06   0.0242193  0.03258
    9      16 4351      4335   1926.53   1904.06   0.0203889  0.0326336
   10      16 4855      4839   1935.46   2016.06   0.0325821  0.0324852
Total time run:    10.032589
Total reads made:    4856
Read size:   4194432
Object size:   4194432
Bandwidth (MB/sec):   1936.15
Average IOPS:    484
Stddev IOPS:   11
Max IOPS:    504
Min IOPS:    461
Average Latency(s):   0.0325118
Max latency(s):    0.250729
Min latency(s):    0.0149525
#
# rados bench -p benchmark_erasure_coded 10 write --no-cleanup
hints = 1
Maintaining 16 concurrent writes of 4202496 bytes to objects of size
4202496 for up to 10 seconds or 0 objects
Object prefix: benchmark_data_sg08-09_20674
sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg
lat(s)
  0       0         0         0         0         0      -           0
  1      16       400       384   1538.92      1539  0.0234887 
 0.0401751
  2      16       824       808   1618.98   1699.31  0.0215378 
 0.0390947
  3      16      1212      1196    1597.6   1555.03  0.0169675 
 0.0394463
  4      16      1551      1535   1537.81   1358.65  0.0605608 
 0.0413843
  5      16      1947      1931   1547.63   1587.09  0.0183851 
 0.0409588
  6      16      2099      2083   1391.21   609.188 0.014569   0.0401053
  7      16      2174      2158    1235.4   300.586 0.019966   0.0482541
  8      16      2551      2535   1269.82   1510.95  0.0226075 
 0.0503494
  9      16      2727      2711   1207.09   705.375  0.0168164 
 0.0486592
 10      16      3015      2999   1201.79   1154.25  0.0242785 
 0.0531814
Total time run:         10.038395
Total writes made:      3015
Write size:             4202496
Object size:            4202496
Bandwidth (MB/sec):     1203.73
Stddev Bandwidth:       490.656
Max bandwidth (MB/sec): 1699.31
Min bandwidth (MB/sec): 300.586
Average IOPS:           300
Stddev IOPS:            122
Max IOPS:               424
Min IOPS:               75
Average Latency(s):     0.0532326
Stddev Latency(s):      0.145877
Max latency(s):         1.92627
Min latency(s):         0.0125254
#
# rados bench -p benchmark_erasure_coded 10 seq
hints = 1
sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg
lat(s)
  0       0         0         0         0         0      -           0
  1      15       432       417   1670.33   1671.26  0.0846208 
 0.0365352
  2      16       856       840   1682.65    1695.3  0.0317332 
 0.0368908
  3      16      1260      1244   1661.41   1619.16  0.0143683 
 0.0376721
  4      16      1657      1641   1643.78    1591.1  0.0307071 
 0.0381582
  5      16      2083      2067   1656.44   1707.33  0.0108411 
 0.0379383
  6      15      2492      2477   1654.09    1643.2  0.0612322   
 0.03794
  7      16      2897      2881   1649.07   1619.16  0.0646332   
0.038115
Total time run:       7.352938
Total reads made:     3015
Read size:            4202496
Object size:          4202496
Bandwidth (MB/sec):   1643.36
Average IOPS:         410
Stddev IOPS:          10
Max IOPS:             426
Min IOPS:             397
Average Latency(s):   0.0383485
Max latency(s):       0.232213
Min latency(s):       0.0086436
#
# rados bench -p benchmark_erasure_coded 10 rand
hints = 1
sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg
lat(s)
  0       0         0         0         0         0      -           0
  1      15       443       428   1714.95   1715.34 0.149024   0.0358012
  2      15       871       856      1715   1715.34  0.0134417 
 0.0360376
  3      16      1314      1298   1733.73   1771.45  0.0096009 
 0.0359534
  4      15      1739      1724   1727.07   1707.33  0.0186676 
 0.0362362
  5      16      2146      2130   1707.04   1627.17  0.0320369 
 0.0367355
  6      16      2577      2561   1710.39   1727.37  0.0112118   
0.036779
  7      16      2998      2982   1707.06   1687.29  0.0110222 
 0.0367852
  8      16      3477      3461   1733.61   1919.74  0.0420042 
 0.0362533
  9      16      3897      3881   1727.99   1683.28  0.0124048 
 0.0363658
 10      16      4320      4304    1724.7    1695.3 0.011312   0.0364624
Total time run:       10.043213
Total reads made:     4320
Read size:            4202496
Object size:          4202496
Bandwidth (MB/sec):   1723.93
Average IOPS:         430
Stddev IOPS:          19
Max IOPS:             479
Min IOPS:             406
Average Latency(s):   0.0365779
Max latency(s):       0.261305
Min latency(s):       0.00883435
#
# rados bench -p volumes 10 write --no-cleanup
hints = 1
Maintaining 16 concurrent writes of 4194304 bytes to objects of size
4194304 for up to 10 seconds or 0 objects
Object prefix: benchmark_data_sg08-09_21005
  sec Cur ops   started finished  avg MB/s  cur MB/s last lat(s)  avg
lat(s)
    0       0         0    0         0         0           -           0
    1      16       324  308   1231.94      1232   0.0390295    0.049679
    2      16       612  596   1191.88      1152   0.0245875   0.0527361
    3      16       874  858   1143.87      1048   0.0223238   0.0553406
    4      16      1144 1128   1127.86      1080   0.0715606   0.0558464
    5      16      1388 1372   1097.46       976   0.0279251   0.0571859
    6      16      1615 1599   1065.87       908    0.159919   0.0584934
    7      16      1849 1833    1047.3       936   0.0167605   0.0601535
    8      16      2034 2018   1008.87       740   0.0438302   0.0628943
    9      16      2265 2249   999.427       924    0.035679   0.0632574
   10      16      2499 2483   993.071       936   0.0244276   0.0640876
Total time run:  10.140100
Total writes made:      2500
Write size:  4194304
Object size: 4194304
Bandwidth (MB/sec):  986.184
Stddev Bandwidth:  139.753
Max bandwidth (MB/sec): 1232
Min bandwidth (MB/sec): 740
Average IOPS:           246
Stddev IOPS:            34
Max IOPS:               308
Min IOPS:               185
Average Latency(s):  0.0645551
Stddev Latency(s): 0.0796484
Max latency(s):  0.860657
Min latency(s):  0.012328
#
# rados bench -p volumes 10 seq
hints = 1
  sec Cur ops   started finished  avg MB/s  cur MB/s last lat(s)  avg
lat(s)
    0       0         0    0         0         0           -           0
    1      15       520  505   2018.81      2020     0.12669   0.0305952
    2      16      1021 1005   2009.25      2000   0.0270969   0.0310014
    3      16      1527 1511   2014.08      2024   0.0567405   0.0307781
    4      16      2038 2022    2021.5      2044   0.0113705   0.0307111
Total time run:  4.929016
Total reads made:     2500
Read size:            4194304
Object size:          4194304
Bandwidth (MB/sec):   2028.8
Average IOPS:         507
Stddev IOPS:          4
Max IOPS:             511
Min IOPS:             500
Average Latency(s):  0.0308263
Max latency(s):  0.252516
Min latency(s):  0.00706628
#
# rados bench -p volumes 10 rand
hints = 1
  sec Cur ops   started finished  avg MB/s  cur MB/s last lat(s)  avg
lat(s)
    0       0         0    0         0         0           -           0
    1      16       535  519   2075.65      2076   0.0126651    0.029495
    2      16      1035 1019   2037.64      2000   0.0522776   0.0302652
    3      16      1570 1554   2071.65      2140   0.0255347   0.0301478
    4      16      2092 2076   2075.66      2088   0.0201707   0.0301096
    5      15      2627 2612   2089.27      2144   0.0361379   0.0299093
    6      16      3148 3132   2087.69      2080   0.0506905    0.029938
    7      16      3660 3644   2081.99      2048   0.0157957   0.0299908
    8      16      4178 4162   2080.71      2072   0.0105906   0.0300153
    9      16      4718 4702   2089.49      2160   0.0144684   0.0299537
   10      16      5257 5241   2096.11      2156   0.0157336    0.029869
Total time run:  10.028729
Total reads made:     5257
Read size:            4194304
Object size:          4194304
Bandwidth (MB/sec):   2096.78
Average IOPS:         524
Stddev IOPS:          13
Max IOPS:             540
Min IOPS:             500
Average Latency(s):  0.0298879
Max latency(s):  0.509513
Min latency(s):  0.00704302
#
# rados bench -p benchmark_replicated 10 write --no-cleanup
hints = 1
Maintaining 16 concurrent writes of 4194304 bytes to objects of size
4194304 for up to 10 seconds or 0 objects
Object prefix: benchmark_data_sg08-09_21089
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s) 
avg lat(s)
    0       0         0         0         0    0           -           0
    1      16       218       202   807.933  808   0.0375772   0.0725827
    2      15       418       403   805.897  804   0.0287712   0.0739311
    3      16       585       569   758.578  664   0.0184545   0.0799763
    4      16       766       750   749.915  724   0.0258367    0.078744
    5      16       975       959   767.115  836    0.871661   0.0792439
    6      16      1129      1113   741.915  616   0.0412471   0.0793611
    7      16      1276      1260   719.916  588   0.0162774   0.0804685
    8      15      1502      1487   743.415  908   0.0186694   0.0778871
    9      16      1687      1671   742.583  736   0.0158112   0.0790996
   10      16      1832      1816   726.316  580     1.83527     0.07856
   11      15      1832      1817   660.652    4    0.120861   0.0785833
   12      15      1832      1817   605.599    0           -   0.0785833
   13      15      1832      1817   559.015    0           -   0.0785833
   14      15      1832      1817   519.085    0           -   0.0785833
   15      15      1832      1817    484.48    0           -   0.0785833
   16      15      1832      1817   454.201    0           -   0.0785833
   17       8      1832      1824    429.13  4.66667     8.16068   
0.109725
Total time run:         17.755495
Total writes made:      1832
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     412.717
Stddev Bandwidth:       377.473
Max bandwidth (MB/sec): 908
Min bandwidth (MB/sec): 0
Average IOPS:           103
Stddev IOPS:            94
Max IOPS:               227
Min IOPS:               0
Average Latency(s):     0.146787
Stddev Latency(s):      0.789571
Max latency(s):         13.2518
Min latency(s):         0.0117266
#
# rados bench -p benchmark_replicated 10 seq
hints = 1
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s) 
avg lat(s)
    0       0         0         0         0    0           -           0
    1      16       496       480   1919.58 1920   0.0111955   0.0289794
    2      15       979       964   1927.46 1936   0.0204766   0.0294553
    3      16      1425      1409   1878.23 1780   0.0209146    0.027708
    4       7      1832      1825   1824.63 1664   0.0286691   0.0330634
    5       7      1832      1825   1459.73    0           -   0.0330634
Total time run:       5.593011
Total reads made:     1832
Read size:            4194304
Object size:          4194304
Bandwidth (MB/sec):   1310.21
Average IOPS:         327
Stddev IOPS:          205
Max IOPS:             484
Min IOPS:             0
Average Latency(s):   0.0399277
Max latency(s):       2.54257
Min latency(s):       0.00632494
#
# rados bench -p benchmark_replicated 10 rand
hints = 1
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s) 
avg lat(s)
    0       0         0         0         0    0           -           0
    1      15       515       500   1999.26 2000   0.0260058   0.0289045
    2      16      1008       992   1983.45 1968  0.00914651   0.0297668
    3      16      1512      1496   1994.23 2016   0.0163293   0.0309669
    4      16      1996      1980    1979.6 1936   0.0123961   0.0313833
    5      15      2486      2471   1976.43 1964   0.0318256   0.0312294
    6      16      2992      2976   1983.64 2020   0.0346031   0.0313301
    7      15      3498      3483   1989.94 2028   0.0119796   0.0314029
    8      16      4018      4002   2000.65 2076   0.0374133   0.0312428
    9      16      4558      4542   2018.33 2160    0.024143   0.0308669
   10      15      5101      5086   2034.07 2176   0.0317191   0.0307552
Total time run:       10.032364
Total reads made:     5101
Read size:            4194304
Object size:          4194304
Bandwidth (MB/sec):   2033.82
Average IOPS:         508
Stddev IOPS:          20
Max IOPS:             544
Min IOPS:             484
Average Latency(s):   0.0307879
Max latency(s):       1.3466
Min latency(s):       0.00688148
#
Regards
Jakub
On Thu, Feb 1, 2018 at 3:33 PM, Jakub Jaszewski
Regarding split & merge, I have default values
filestore_merge_threshold = 10
filestore_split_multiple = 2
according to https://bugzilla.redhat.com/show_bug.cgi?id=1219974
<https://bugzilla.redhat.com/show_bug.cgi?id=1219974> the
recommended values are
filestore_merge_threshold = 40
filestore_split_multiple = 8
Is it something that I can easily change to default or lower
values than proposed in case of further performance degradation ?
I did tests of 4 pools: 2 replicated pools (x3 ) and 2 EC  pools
(k=6,m=3)
The pool with the lowest bandwidth has osd tree structure like
├── 20.115s1_head
│   └── DIR_5
│       └── DIR_1
│           ├── DIR_1
│           │   ├── DIR_0
│           │   ├── DIR_1
│           │   ├── DIR_2
│           │   │   ├── DIR_0
│           │   │   ├── DIR_1
│           │   │   ├── DIR_2
│           │   │   ├── DIR_3
│           │   │   ├── DIR_4
│           │   │   ├── DIR_5
│           │   │   ├── DIR_6
│           │   │   ├── DIR_7
│           │   │   ├── DIR_8
│           │   │   ├── DIR_9
│           │   │   ├── DIR_A
│           │   │   ├── DIR_B
│           │   │   ├── DIR_C
│           │   │   ├── DIR_D
│           │   │   ├── DIR_E
│           │   │  └── DIR_F
Tests results
# rados bench -p default.rgw.buckets.data 10 write
hints = 1
Maintaining 16 concurrent writes of 4194432 bytes to objects of
size 4194432 for up to 10 seconds or 0 objects
Object prefix: benchmark_data_sg08-09_180679
  sec Cur ops  started  finished  avg MB/s  cur MB/s last lat(s) 
avg lat(s)
    0       0    0         0         0         0  -           0
    1      16  129       113   451.975   452.014  0.0376714   
0.128027
    2      16  209       193   385.964    320.01 0.119609    0.138517
    3      16  235       219   291.974   104.003  0.0337624   
 0.13731
    4      16  235       219   218.981         0  -     0.13731
    5      16  266       250   199.983   62.0019 0.111673    0.238424
    6      16  317       301   200.649   204.006  0.0340569   
0.298489
    7      16  396       380   217.124    316.01  0.0379956   
0.283458
    8      16  444       428   213.981   192.006  0.0304383   
0.274193
    9      16  485       469   208.426   164.005 0.391956    0.283421
   10      16  496       480   191.983   44.0013 0.104497    0.292074
   11      16  497       481   174.894   4.00012 0.999985    0.293545
   12      16  497       481    160.32         0  -    0.293545
   13      16  497       481   147.987         0  -    0.293545
   14      16  497       481   137.417         0  -    0.293545
Total time run:      14.493353
Total writes made:     497
Write size:      4194432
Object size:     4194432
Bandwidth (MB/sec):     137.171
Stddev Bandwidth:      147.001
Max bandwidth (MB/sec): 452.014
Min bandwidth (MB/sec): 0
Average IOPS:      34
Stddev IOPS:     36
Max IOPS:      113
Min IOPS:      0
Average Latency(s):     0.464281
Stddev Latency(s):     1.09388
Max latency(s):      6.3723
Min latency(s):      0.023835
Cleaning up (deleting benchmark objects)
Removed 497 objects
Clean up completed and total clean up time :10.622382
#
# rados bench -p benchmark_erasure_coded 10 write
hints = 1
Maintaining 16 concurrent writes of 4202496 bytes to objects of
size 4202496 for up to 10 seconds or 0 objects
Object prefix: benchmark_data_sg08-09_180807
  sec Cur ops  started  finished  avg MB/s  cur MB/s last lat(s) 
avg lat(s)
    0       0    0         0         0         0  -           0
    1      16  424       408   1635.11   1635.19  0.0490434 
 0.0379616
    2      16  828       812   1627.03   1619.16  0.0616501 
 0.0388467
    3      16 1258      1242   1659.06   1723.36  0.0304412 
 0.0384537
    4      16 1659      1643   1646.03   1607.13  0.0155402 
 0.0387351
    5      16 2053      2037   1632.61   1579.08  0.0453354 
 0.0390236
    6      16 2455      2439      1629   1611.14  0.0485313 
 0.0392376
    7      16 2649      2633   1507.34   777.516  0.0148972 
 0.0393161
    8      16 2858      2842   1423.61   837.633  0.0157639 
 0.0449088
    9      16 3245      3229   1437.75   1551.02  0.0200845 
 0.0444847
   10      16 3629      3613   1447.85      1539  0.0654451 
 0.0441569
Total time run:      10.229591
Total writes made:     3630
Write size:      4202496
Object size:     4202496
Bandwidth (MB/sec):     1422.18
Stddev Bandwidth:      341.609
Max bandwidth (MB/sec): 1723.36
Min bandwidth (MB/sec): 777.516
Average IOPS:      354
Stddev IOPS:     85
Max IOPS:      430
Min IOPS:      194
Average Latency(s):     0.0448612
Stddev Latency(s):     0.0712224
Max latency(s):      1.08353
Min latency(s):      0.0134629
Cleaning up (deleting benchmark objects)
Removed 3630 objects
Clean up completed and total clean up time :2.321669
#
# rados bench -p volumes 10 write
hints = 1
Maintaining 16 concurrent writes of 4194304 bytes to objects of
size 4194304 for up to 10 seconds or 0 objects
Object prefix: benchmark_data_sg08-09_180651
  sec Cur ops  started  finished  avg MB/s  cur MB/s last lat(s) 
avg lat(s)
    0       0    0         0         0         0  -           0
    1      16  336       320   1279.89      1280  0.0309006 
 0.0472524
    2      16  653       637   1273.84      1268  0.0465151 
 0.0495701
    3      16  956       940   1253.17      1212  0.0337327 
 0.0504146
    4      16 1256      1240   1239.85      1200  0.0177263 
 0.0509145
    5      16 1555      1539   1231.05      1196  0.0364991 
 0.0516724
    6      16 1868      1852   1234.51      1252  0.0260964 
 0.0510236
    7      16 2211      2195   1254.13      1372 0.040738    0.050847
    8      16 2493      2477   1238.35      1128  0.0228582 
 0.0514979
    9      16 2838      2822   1254.07      1380  0.0265224 
 0.0508641
   10      16 3116      3100   1239.85      1112  0.0160151 
 0.0513104
Total time run:      10.192091
Total writes made:     3117
Write size:      4194304
Object size:     4194304
Bandwidth (MB/sec):     1223.3
Stddev Bandwidth:      89.9383
Max bandwidth (MB/sec): 1380
Min bandwidth (MB/sec): 1112
Average IOPS:      305
Stddev IOPS:     22
Max IOPS:      345
Min IOPS:      278
Average Latency(s):     0.0518144
Stddev Latency(s):     0.0529575
Max latency(s):      0.663523
Min latency(s):      0.0122169
Cleaning up (deleting benchmark objects)
Removed 3117 objects
Clean up completed and total clean up time :0.212296
#
# rados bench -p benchmark_replicated  10 write
hints = 1
Maintaining 16 concurrent writes of 4194304 bytes to objects of
size 4194304 for up to 10 seconds or 0 objects
Object prefix: benchmark_data_sg08-09_180779
  sec Cur ops  started  finished  avg MB/s  cur MB/s last lat(s) 
avg lat(s)
    0       0    0         0         0         0  -           0
    1      16  309       293   1171.94      1172  0.0233267 
 0.0508877
    2      16  632       616   1231.87      1292  0.0258237   
0.049612
    3      16  959       943   1257.19      1308  0.0335615 
 0.0483464
    4      16 1276      1260   1259.85      1268 0.031461   0.0504689
    5      16 1643      1627   1301.44      1468  0.0274032 
 0.0489651
    6      16 1991      1975   1316.51      1392  0.0408116 
 0.0483596
    7      16 2328      2312   1320.98      1348  0.0242298   
0.048175
    8      16 2677      2661   1330.33      1396 0.097513    0.047962
    9      16 3042      3026   1344.72      1460  0.0196724 
 0.0474078
   10      16 3384      3368   1347.03      1368  0.0426199 
 0.0472573
Total time run:      10.482871
Total writes made:     3384
Write size:      4194304
Object size:     4194304
Bandwidth (MB/sec):     1291.25
Stddev Bandwidth:      90.4861
Max bandwidth (MB/sec): 1468
Min bandwidth (MB/sec): 1172
Average IOPS:      322
Stddev IOPS:     22
Max IOPS:      367
Min IOPS:      293
Average Latency(s):     0.048763
Stddev Latency(s):     0.0547666
Max latency(s):      0.938211
Min latency(s):      0.0121556
Cleaning up (deleting benchmark objects)
Removed 3384 objects
Clean up completed and total clean up time :0.239684
#
Luis why did you advise against increasing pg_num pgp_num ? I'm
wondering which option is better: increasing pg_num or
filestore_merge_threshold and filestore_split_multiple ?
Thanks
Jakub
On Thu, Feb 1, 2018 at 9:38 AM, Jaroslaw Owsiewski
Hi,
maybe "split  is on the floor"?
Regards
--
Jarek
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Frédéric Nass
2018-02-05 17:37:45 UTC
Permalink
Raw Message
Hi Jakub,
Hi Frederic,
Many thanks for your contribution to the topic!
I've just set logging level 20 for filestore via
ceph tell osd.0 config set debug_filestore 20
but so far
​found
 nothing by keyword 'split'
​ in ​/var/log/ceph/ceph-osd.0.log
So, if you're running ceph > 12.2.1, that means splitting is not
happening. Did you check during writes ? Did you check other OSDs logs ?

Actually, splitting should not happen now that you've increased
​​filestore_merge_threshold and filestore_split_multiple values.
​I've also run your script across the cluster nodes, results as follows
id=3, pool=volumes, objects=10454548, avg=160.28
id=20, pool=default.rgw.buckets.data, objects=22419862, avg=35.2344
id=3, pool=volumes, objects=10454548, avg=159.22
id=20, pool=default.rgw.buckets.data, objects=22419862, avg=35.9994
id=3, pool=volumes, objects=10454548, avg=159.843
id=20, pool=default.rgw.buckets.data, objects=22419862, avg=34.7435
id=3, pool=volumes, objects=10454548, avg=159.695
id=20, pool=default.rgw.buckets.data, objects=22419862, avg=35.0579
id=3, pool=volumes, objects=10454548, avg=160.594
id=20, pool=default.rgw.buckets.data, objects=22419862, avg=34.7757
id=3, pool=volumes, objects=10454548, avg=160.099
id=20, pool=default.rgw.buckets.data, objects=22419862, avg=33.8517
id=3, pool=volumes, objects=10454548, avg=159.912
id=20, pool=default.rgw.buckets.data, objects=22419862, avg=37.5698
id=3, pool=volumes, objects=10454548, avg=159.407
id=20, pool=default.rgw.buckets.data, objects=22419862, avg=35.4991
id=3, pool=volumes, objects=10454548, avg=160.075
id=20, pool=default.rgw.buckets.data, objects=22419862, avg=35.481
Looks like there is nothing to be handled by split, am I right? But
what about merging ? Avg is less than 40
 ​, should directories structure be reduced now?
It should, I guess. But then you'd see blocked requests on every object
deletion. If you do, you might want to set ​​filestore_merge_threshold
to -40 (negative value) so merging does not happen anymore.
Splitting would still happen over 5120 files per subdirectory.
    "
​​
filestore_merge_threshold": "40",
"filestore_split_multiple": "8",
"filestore_split_rand_factor": "20",
M
​ay I ask for the link to documentation where I can read more about
OSD underlying directory structure?
I'm not aware of any related documentation.

Do you still observe slow or blocked requests now that you've increased
​​filestore_merge_threshold and filestore_split_multiple ?

Regards,

Frédéric.
​And just noticed log entries in /var/log/ceph/ceph-osd.0.log
​2018-02-05 11:22:03.346400 7f3cc94fe700  0 -- 10.212.14.11:6818/4702
<http://10.212.14.11:6818/4702> >> 10.212.14.17:6802/82845
<http://10.212.14.17:6802/82845> conn(0xe254cca800 :6818
s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
l=0).handle_connect_msg accept connect_seq 27 vs existing csq=27
existing_state=STATE_STANDBY
2018-02-05 11:22:03.346583 7f3cc94fe700  0 -- 10.212.14.11:6818/4702
<http://10.212.14.11:6818/4702> >> 10.212.14.17:6802/82845
<http://10.212.14.17:6802/82845> conn(0xe254cca800 :6818
s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
l=0).handle_connect_msg accept connect_seq 28 vs existing csq=27
existing_state=STATE_STANDBY
​
M
​
any thanks!​
​Jakub​
​
On Mon, Feb 5, 2018 at 9:56 AM, Frédéric Nass
Hi,
In addition, starting with Luminous 12.2.1 (RHCS 3), splitting ops
https://github.com/ceph/ceph/blob/v12.2.1/src/os/filestore/HashIndex.cc#L320
<https://github.com/ceph/ceph/blob/v12.2.1/src/os/filestore/HashIndex.cc#L320>
https://bugzilla.redhat.com/show_bug.cgi?id=1523532
<https://bugzilla.redhat.com/show_bug.cgi?id=1523532>
Regards,
Frédéric.
Post by Frédéric Nass
Hi,
Split and merge operations happen during writes only, splitting
on file creation and merging on file deletion.
As you don't see any blocked requests during reads I would guess
your issue happens during splitting. Now that you increased
filestore_merge_threshold and filestore_split_multiple, you
shouldn't expect any splitting operations to happen any soon, nor
any merging operations, unless your workload consists of writing
a huge number of files and removing them.
You should check how many files are in each lower directories of
pool 20's PGs. This would help to confirm that the blocked
requests come with the splitting.
We now use the below script (on one of the OSD nodes) to get an
average value of the number of files in some PGs of each pool and
run this script every 5 minutes with Munin to get a graph out of
the values.
This way, we can anticipate the splitting and even provoke the
#!/bin/bash
for pool in $(ceph osd pool ls detail | grep -v -E 'snap|^$' |
cut -d " " -f2-3 | sed -e 's/ /|/g') ; do
pool_id=$(echo $pool | cut -d "|" -f1)
pool_name=$(echo $pool | cut -d "|" -f2 | sed -e "s/'//g")
nbr_objects=$(rados df | grep "$pool_name " | awk '{print $3}')
#echo "$pool_id, $pool_name |$nbr_objects|"
ls -d /var/lib/ceph/osd/ceph-*/current/$pool_id.*_head >
/dev/null 2>&1
#echo $?
#continue
if [ $? -eq 0 ]; then
avg=$(for pg_dir in $(ls -d
/var/lib/ceph/osd/ceph-*/current/$pool_id.*_head | tail -5) ; do
find $pg_dir -type d -print0 | while read -d '' -r dir; do
files=`find $dir -type f -maxdepth 1|wc -l`; printf "$files in
$dir\n";done | grep -v '^0 ' | awk '{ sum += $1; n++ } END { if
(n > 0) print sum / n; }' ; done | awk '{ sum += $1; n++ } END {
if (n > 0) print sum / n; }')
else
avg=0
fi
echo "id=$pool_id, pool=$pool_name, objects=$nbr_objects, avg=$avg"
done
With 40/8 values, subdirectory splitting will happen when the
number of files goes over 5120 and merging when the number of
files goes below 40. My assumption is that merging can also
happen to intermediate directories when then go below 40 files,
but I'm not sure about that.
If you ever need to provoke splitting on a pool, you can do this
with a rados bench (don't forget the prefix so you can easily
remove the files aftewards if it turns out that some were left in
the pool).
ceph-osd -i ${osd_num} --flush-journal
ceph-objectstore-tool --data-path
/var/lib/ceph/osd/ceph-${osd_num} --journal-path
/var/lib/ceph/osd/ceph-${osd_num}/journal
--log-file=/var/log/ceph/objectstore_tool.${osd_num}.log --op
apply-layout-settings --pool ${pool_name}
Regards,
Frederic.
Post by Jakub Jaszewski
Hi,
So I have changed merge & split settings to
filestore_merge_threshold = 40
filestore_split_multiple = 8
and restart all OSDs , host by host.
Let me ask a question, although the pool
default.rgw.buckets.data that was affected prior to the above
change has higher write bandwidth it is very random now. Writes
are random for other pools (same for EC and replicated types)
too, before the change writes to replicated pools were much more
stable.
Reads from pools look fine and stable.
Is it the result of mentioned change ? Is PG directory structure
updating or ...?
[...]
Regards
Jakub
On Thu, Feb 1, 2018 at 3:33 PM, Jakub Jaszewski
Regarding split & merge, I have default values
filestore_merge_threshold = 10
filestore_split_multiple = 2
according to
https://bugzilla.redhat.com/show_bug.cgi?id=1219974
<https://bugzilla.redhat.com/show_bug.cgi?id=1219974> the
recommended values are
filestore_merge_threshold = 40
filestore_split_multiple = 8
Is it something that I can easily change to default or lower
values than proposed in case of further performance
degradation ?
I did tests of 4 pools: 2 replicated pools (x3 ) and 2 EC 
pools (k=6,m=3)
The pool with the lowest bandwidth has osd tree structure like
├── 20.115s1_head
│   └── DIR_5
│  └── DIR_1
│  ├── DIR_1
│  │   ├── DIR_0
│  │   ├── DIR_1
│  │   ├── DIR_2
│  │   │   ├── DIR_0
│  │   │   ├── DIR_1
│  │   │   ├── DIR_2
│  │   │   ├── DIR_3
│  │   │   ├── DIR_4
│  │   │   ├── DIR_5
│  │   │   ├── DIR_6
│  │   │   ├── DIR_7
│  │   │   ├── DIR_8
│  │   │   ├── DIR_9
│  │   │   ├── DIR_A
│  │   │   ├── DIR_B
│  │   │   ├── DIR_C
│  │   │   ├── DIR_D
│  │   │   ├── DIR_E
│    │   │   └── DIR_F
Tests results
# rados bench -p default.rgw.buckets.data 10 write
hints = 1
Maintaining 16 concurrent writes of 4194432 bytes to objects
of size 4194432 for up to 10 seconds or 0 objects
Object prefix: benchmark_data_sg08-09_180679
  sec Cur ops   started  finished avg MB/s  cur MB/s last
lat(s)  avg lat(s)
0       0         0         0        0         0  -           0
1      16       129       113  451.975   452.014  0.0376714 
  0.128027
2      16       209       193  385.964    320.01 0.119609   
0.138517
3      16       235       219  291.974   104.003  0.0337624 
   0.13731
4      16       235       219  218.981         0  -     0.13731
5      16       266       250  199.983   62.0019 0.111673   
0.238424
6      16       317       301  200.649   204.006  0.0340569 
  0.298489
7      16       396       380  217.124    316.01  0.0379956 
  0.283458
8      16       444       428  213.981   192.006  0.0304383 
  0.274193
9      16       485       469  208.426   164.005 0.391956   
0.283421
 10      16       496  480   191.983   44.0013 0.104497   
0.292074
 11      16       497  481   174.894   4.00012 0.999985   
0.293545
 12      16       497  481    160.32         0      -   
0.293545
 13      16       497  481   147.987         0      -   
0.293545
 14      16       497  481   137.417         0      -   
0.293545
Total time run:         14.493353
Total writes made:      497
Write size:             4194432
Object size:            4194432
Bandwidth (MB/sec):     137.171
Stddev Bandwidth:       147.001
Max bandwidth (MB/sec): 452.014
Min bandwidth (MB/sec): 0
Average IOPS:           34
Stddev IOPS:            36
Max IOPS:               113
Min IOPS:               0
Average Latency(s):     0.464281
Stddev Latency(s):      1.09388
Max latency(s):         6.3723
Min latency(s):         0.023835
Cleaning up (deleting benchmark objects)
Removed 497 objects
Clean up completed and total clean up time :10.622382
#
# rados bench -p benchmark_erasure_coded 10 write
hints = 1
Maintaining 16 concurrent writes of 4202496 bytes to objects
of size 4202496 for up to 10 seconds or 0 objects
Object prefix: benchmark_data_sg08-09_180807
  sec Cur ops   started  finished avg MB/s  cur MB/s last
lat(s)  avg lat(s)
0       0         0         0        0         0  -           0
1      16       424       408  1635.11   1635.19  0.0490434 
 0.0379616
2      16       828       812  1627.03   1619.16  0.0616501 
 0.0388467
3      16      1258      1242  1659.06   1723.36  0.0304412 
 0.0384537
4      16      1659      1643  1646.03   1607.13  0.0155402 
 0.0387351
5      16      2053      2037  1632.61   1579.08  0.0453354 
 0.0390236
6      16      2455      2439     1629   1611.14  0.0485313 
 0.0392376
7      16      2649      2633  1507.34   777.516  0.0148972 
 0.0393161
8      16      2858      2842  1423.61   837.633  0.0157639 
 0.0449088
9      16      3245      3229  1437.75   1551.02  0.0200845 
 0.0444847
 10      16      3629 3613   1447.85      1539  0.0654451 
 0.0441569
Total time run:         10.229591
Total writes made:      3630
Write size:             4202496
Object size:            4202496
Bandwidth (MB/sec):     1422.18
Stddev Bandwidth:       341.609
Max bandwidth (MB/sec): 1723.36
Min bandwidth (MB/sec): 777.516
Average IOPS:           354
Stddev IOPS:            85
Max IOPS:               430
Min IOPS:               194
Average Latency(s):     0.0448612
Stddev Latency(s):      0.0712224
Max latency(s):         1.08353
Min latency(s):         0.0134629
Cleaning up (deleting benchmark objects)
Removed 3630 objects
Clean up completed and total clean up time :2.321669
#
# rados bench -p volumes 10 write
hints = 1
Maintaining 16 concurrent writes of 4194304 bytes to objects
of size 4194304 for up to 10 seconds or 0 objects
Object prefix: benchmark_data_sg08-09_180651
  sec Cur ops   started  finished avg MB/s  cur MB/s last
lat(s)  avg lat(s)
0       0         0         0        0         0  -           0
1      16       336       320  1279.89      1280  0.0309006 
 0.0472524
2      16       653       637  1273.84      1268  0.0465151 
 0.0495701
3      16       956       940  1253.17      1212  0.0337327 
 0.0504146
4      16      1256      1240  1239.85      1200  0.0177263 
 0.0509145
5      16      1555      1539  1231.05      1196  0.0364991 
 0.0516724
6      16      1868      1852  1234.51      1252  0.0260964 
 0.0510236
7      16      2211      2195  1254.13      1372 0.040738   
0.050847
8      16      2493      2477  1238.35      1128  0.0228582 
 0.0514979
9      16      2838      2822  1254.07      1380  0.0265224 
 0.0508641
 10      16      3116 3100   1239.85      1112  0.0160151 
 0.0513104
Total time run:         10.192091
Total writes made:      3117
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     1223.3
Stddev Bandwidth:       89.9383
Max bandwidth (MB/sec): 1380
Min bandwidth (MB/sec): 1112
Average IOPS:           305
Stddev IOPS:            22
Max IOPS:               345
Min IOPS:               278
Average Latency(s):     0.0518144
Stddev Latency(s):      0.0529575
Max latency(s):         0.663523
Min latency(s):         0.0122169
Cleaning up (deleting benchmark objects)
Removed 3117 objects
Clean up completed and total clean up time :0.212296
#
# rados bench -p benchmark_replicated  10 write
hints = 1
Maintaining 16 concurrent writes of 4194304 bytes to objects
of size 4194304 for up to 10 seconds or 0 objects
Object prefix: benchmark_data_sg08-09_180779
  sec Cur ops   started  finished avg MB/s  cur MB/s last
lat(s)  avg lat(s)
0       0         0         0        0         0  -           0
1      16       309       293  1171.94      1172  0.0233267 
 0.0508877
2      16       632       616  1231.87      1292  0.0258237 
  0.049612
3      16       959       943  1257.19      1308  0.0335615 
 0.0483464
4      16      1276      1260  1259.85      1268 0.031461 
 0.0504689
5      16      1643      1627  1301.44      1468  0.0274032 
 0.0489651
6      16      1991      1975  1316.51      1392  0.0408116 
 0.0483596
7      16      2328      2312  1320.98      1348  0.0242298 
  0.048175
8      16      2677      2661  1330.33      1396 0.097513   
0.047962
9      16      3042      3026  1344.72      1460  0.0196724 
 0.0474078
 10      16      3384 3368   1347.03      1368  0.0426199 
 0.0472573
Total time run:         10.482871
Total writes made:      3384
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     1291.25
Stddev Bandwidth:       90.4861
Max bandwidth (MB/sec): 1468
Min bandwidth (MB/sec): 1172
Average IOPS:           322
Stddev IOPS:            22
Max IOPS:               367
Min IOPS:               293
Average Latency(s):     0.048763
Stddev Latency(s):      0.0547666
Max latency(s):         0.938211
Min latency(s):         0.0121556
Cleaning up (deleting benchmark objects)
Removed 3384 objects
Clean up completed and total clean up time :0.239684
#
Luis why did you advise against increasing pg_num pgp_num ?
I'm wondering which option is better: increasing pg_num or
filestore_merge_threshold and filestore_split_multiple ?
Thanks
Jakub
On Thu, Feb 1, 2018 at 9:38 AM, Jaroslaw Owsiewski
Hi,
maybe "split  is on the floor"?
Regards
--
Jarek
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
<http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
<http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
Jakub Jaszewski
2018-02-06 15:03:22 UTC
Permalink
Raw Message
​Hi Frederic,

I've not enable debug level logging on all OSDs, just on one for the test,
need to double check that.
But looks that merging is ongoing on few OSDs or OSDs are faulty, I will
dig into that tomorrow.
Write bandwidth is very random

# rados bench -p default.rgw.buckets.data 120 write
hints = 1
Maintaining 16 concurrent writes of 4194432 bytes to objects of size
4194432 for up to 120 seconds or 0 objects
Object prefix: benchmark_data_sg08-09_59104
sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg
lat(s)
0 0 0 0 0 0 -
0
1 16 155 139 555.93 556.017 0.0750027
0.10687
2 16 264 248 495.936 436.013 0.154185
0.118693
3 16 330 314 418.616 264.008 0.118476
0.142667
4 16 415 399 398.953 340.01 0.0873379
0.15102
5 16 483 467 373.557 272.008 0.750453
0.159819
6 16 532 516 343.962 196.006 0.0298334
0.171218
7 16 617 601 343.391 340.01 0.192698
0.177288
8 16 700 684 341.963 332.01 0.0281355
0.171277
9 16 762 746 331.521 248.008 0.0962037
0.163734
10 16 804 788 315.167 168.005 1.40356
0.196298
11 16 897 881 320.33 372.011 0.0369085
0.19496
12 16 985 969 322.966 352.011 0.0290563
0.193986
13 15 1106 1091 335.657 488.015 0.0617642
0.188703
14 16 1166 1150 328.537 236.007 0.0401884
0.186206
15 16 1251 1235 329.299 340.01 0.171256
0.190974
16 16 1339 1323 330.716 352.011 0.024222
0.189901
17 16 1417 1401 329.613 312.01 0.0289473
0.186562
18 16 1465 1449 321.967 192.006 0.028123
0.189153
19 16 1522 1506 317.02 228.007 0.265448
0.188288
2018-02-06 13:43:21.412512 min lat: 0.0204657 max lat: 3.61509 avg lat:
0.18918
sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg
lat(s)
20 16 1564 1548 309.568 168.005 0.0327581
0.18918
21 16 1636 1620 308.54 288.009 0.0715159
0.187381
22 16 1673 1657 301.242 148.005 1.57285
0.191596
23 16 1762 1746 303.621 356.011 6.00352
0.206217
24 16 1885 1869 311.468 492.015 0.0298435
0.203874
25 16 2010 1994 319.008 500.015 0.0258761
0.199652
26 16 2116 2100 323.044 424.013 0.0533319
0.19631
27 16 2201 2185 323.67 340.01 0.134796
0.195953
28 16 2257 2241 320.11 224.007 0.473629
0.196464
29 16 2333 2317 319.554 304.009 0.0362741
0.198054
30 16 2371 2355 313.968 152.005 0.438141
0.200265
31 16 2459 2443 315.194 352.011 0.0610629
0.200858
32 16 2525 2509 313.593 264.008 0.0234799
0.201008
33 16 2612 2596 314.635 348.011 0.072019
0.199094
34 16 2682 2666 313.615 280.009 0.10062
0.197586
35 16 2757 2741 313.225 300.009 0.0552581
0.196981
36 16 2849 2833 314.746 368.011 0.257323
0.19565
37 16 2891 2875 310.779 168.005 0.0918386
0.19556
38 16 2946 2930 308.39 220.007 0.0276621
0.195792
39 16 2975 2959 303.456 116.004 0.0588971
0.19952
2018-02-06 13:43:41.415107 min lat: 0.0204657 max lat: 7.9873 avg lat:
0.198749
sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg
lat(s)
40 16 3060 3044 304.369 340.01 0.0217136
0.198749
41 16 3098 3082 300.652 152.005 0.0717398
0.199052
42 16 3141 3125 297.589 172.005 0.0257422
0.201899
43 15 3241 3226 300.063 404.012 0.0733869
0.209446
44 16 3332 3316 301.424 360.011 0.0327249
0.206686
45 16 3430 3414 303.436 392.012 0.0413156
0.203727
46 16 3534 3518 305.882 416.013 0.033638
0.202182
47 16 3602 3586 305.161 272.008 0.0453557
0.200028
48 16 3663 3647 303.886 244.007 0.0779019
0.199777
49 16 3736 3720 303.643 292.009 0.0285231
0.206274
50 16 3849 3833 306.609 452.014 0.0537071
0.208127
51 16 3909 3893 305.303 240.007 0.0366709
0.207793
52 16 3972 3956 304.277 252.008 0.0289131
0.207989
53 16 4048 4032 304.272 304.009 0.0348617
0.207844
54 16 4114 4098 303.525 264.008 0.0799526
0.20701
55 16 4191 4175 303.606 308.009 0.034915
0.206882
56 16 4313 4297 306.898 488.015 0.108777
0.205869
57 16 4449 4433 311.057 544.017 0.0862092
0.205232
58 16 4548 4532 312.52 396.012 0.135814
0.203753
59 16 4678 4662 316.036 520.016 0.0307446
0.202156
2018-02-06 13:44:01.417687 min lat: 0.0204657 max lat: 13.0888 avg lat:
0.201848
sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg
lat(s)
60 16 4734 4718 314.502 224.007 0.223548
0.201848
61 16 4783 4767 312.559 196.006 0.124237
0.203641
62 16 4838 4822 311.066 220.007 0.270943
0.204074
63 16 4929 4913 311.905 364.011 0.123486
0.204716
64 16 5063 5047 315.406 536.016 0.101675
0.202283
65 16 5124 5108 314.307 244.007 0.0504368
0.201181
66 16 5243 5227 316.756 476.015 0.235348
0.201481
67 16 5339 5323 317.759 384.012 0.0478478
0.200486
68 16 5414 5398 317.497 300.009 0.258401
0.200194
69 16 5461 5445 315.62 188.006 0.112516
0.20022
70 16 5523 5507 314.654 248.008 0.0927405
0.201534
71 16 5582 5566 313.546 236.007 0.333586
0.202304
72 16 5690 5674 315.19 432.013 0.0706812
0.201498
73 16 5780 5764 315.804 360.011 0.0306772
0.200354
74 16 5850 5834 315.32 280.009 0.0261027
0.200627
75 16 5905 5889 314.048 220.007 0.101282
0.20075
76 16 6013 5997 315.6 432.013 0.161956
0.202154
77 16 6130 6114 317.578 468.014 0.042322
0.20092
78 16 6232 6216 318.737 408.012 0.0912166
0.199238
79 16 6260 6244 316.12 112.003 0.0406156
0.198971
2018-02-06 13:44:21.420343 min lat: 0.0204657 max lat: 13.0888 avg lat:
0.200443
sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg
lat(s)
80 16 6304 6288 314.368 176.005 0.0409417
0.200443
81 16 6375 6359 313.993 284.009 0.150617
0.19986
82 16 6447 6431 313.676 288.009 0.0730616
0.200072
83 16 6513 6497 313.077 264.008 0.0565517
0.202155
84 16 6594 6578 313.206 324.01 0.0273074
0.202981
85 16 6732 6716 316.015 552.017 0.134015
0.202301
86 16 6878 6862 319.131 584.018 0.157684
0.200182
87 16 6971 6955 319.738 372.011 0.327923
0.199431
88 16 7018 7002 318.241 188.006 0.0317034
0.198737
89 16 7056 7040 316.373 152.005 0.0874119
0.199579
90 16 7095 7079 314.59 156.005 0.027061
0.198883
91 16 7150 7134 313.551 220.007 0.247791
0.199385
92 16 7201 7185 312.36 204.006 0.131592
0.202395
93 16 7216 7200 309.646 60.0018 0.0678182
0.202927
94 16 7276 7260 308.905 240.007 0.040735
0.204362
95 16 7304 7288 306.832 112.003 0.04629
0.204849
96 16 7402 7386 307.719 392.012 0.085536
0.206601
97 16 7524 7508 309.577 488.015 0.904426
0.205316
98 16 7682 7666 312.866 632.019 0.0580214
0.204241
99 16 7840 7824 316.089 632.019 0.13412
0.202289
2018-02-06 13:44:41.423052 min lat: 0.0204657 max lat: 13.0888 avg lat:
0.200948
sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg
lat(s)
100 16 7949 7933 317.288 436.013 0.025568
0.200948
101 16 7996 7980 316.007 188.006 0.328559
0.201496
102 16 8020 8004 313.85 96.0029 0.0524293
0.20214
103 16 8044 8028 311.735 96.0029 0.0628428
0.20236
104 16 8062 8046 309.43 72.0022 0.0320078
0.202743
105 16 8088 8072 307.473 104.003 0.0506497
0.204222
106 16 8127 8111 306.044 156.005 0.0436112
0.20792
107 16 8226 8210 306.885 396.012 0.0452614
0.207666
108 16 8296 8280 306.635 280.009 0.0500199
0.207578
109 16 8367 8351 306.428 284.009 0.0364288
0.207779
110 16 8475 8459 307.569 432.013 0.140141
0.206752
111 16 8559 8543 307.825 336.01 0.0250007
0.206032
112 16 8696 8680 309.968 548.017 0.0249451
0.205808
113 16 8804 8788 311.048 432.013 0.49485
0.205075
114 16 8917 8901 312.284 452.014 0.0731665
0.204527
115 16 9047 9031 314.09 520.016 0.0365535
0.203553
116 16 9130 9114 314.244 332.01 0.0451302
0.203367
117 16 9259 9243 315.968 516.016 0.0610521
0.202253
118 16 9373 9357 317.154 456.014 0.0282051
0.201613
119 16 9465 9449 317.581 368.011 0.0299575
0.200845
2018-02-06 13:45:01.425689 min lat: 0.0204657 max lat: 13.0888 avg lat:
0.200298
sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg
lat(s)
120 16 9545 9529 317.601 320.01 0.0264095
0.200298
121 7 9546 9539 315.307 40.0012 0.659852
0.201486
122 7 9546 9539 312.722 0 -
0.201486
Total time run: 122.442935
Total writes made: 9546
Write size: 4194432
Object size: 4194432
Bandwidth (MB/sec): 311.861
Stddev Bandwidth: 133.889
Max bandwidth (MB/sec): 632.019
Min bandwidth (MB/sec): 0
Average IOPS: 77
Stddev IOPS: 33
Max IOPS: 158
Min IOPS: 0
Average Latency(s): 0.203467
Stddev Latency(s): 0.531046
Max latency(s): 13.0888
Min latency(s): 0.0204657
Cleaning up (deleting benchmark objects)
Removed 9546 objects
Clean up completed and total clean up time :308.444893
#

Many slow requests occurred in log once Cleaning up started. Affected OSDs:
osd.53 and osd.91 are located on different nodes.

2018-02-06 13:42:49.881647 mon.sg01-06 mon.0 10.212.32.18:6789/0 3574 :
cluster [WRN] overall HEALTH_WARN noscrub,nodeep-scrub flag(s) set
2018-02-06 13:44:49.777466 mon.sg01-06 mon.0 10.212.32.18:6789/0 3577 :
cluster [INF] mon.1 10.212.32.19:6789/0
2018-02-06 13:44:49.777627 mon.sg01-06 mon.0 10.212.32.18:6789/0 3578 :
cluster [INF] mon.2 10.212.32.20:6789/0
2018-02-06 13:45:22.221472 mon.sg01-06 mon.0 10.212.32.18:6789/0 3579 :
cluster [WRN] Health check failed: 4 slow requests are blocked > 32 sec
(REQUEST_SLOW)
2018-02-06 13:45:26.227333 mon.sg01-06 mon.0 10.212.32.18:6789/0 3580 :
cluster [INF] Health check cleared: REQUEST_SLOW (was: 4 slow requests are
blocked > 32 sec)
2018-02-06 13:45:40.325885 mon.sg01-06 mon.0 10.212.32.18:6789/0 3582 :
cluster [WRN] Health check failed: 26 slow requests are blocked > 32 sec
(REQUEST_SLOW)
2018-02-06 13:45:46.360383 mon.sg01-06 mon.0 10.212.32.18:6789/0 3583 :
cluster [INF] Health check cleared: REQUEST_SLOW (was: 26 slow requests are
blocked > 32 sec)
2018-02-06 13:46:08.518861 mon.sg01-06 mon.0 10.212.32.18:6789/0 3584 :
cluster [WRN] Health check failed: 1 slow requests are blocked > 32 sec
(REQUEST_SLOW)
2018-02-06 13:46:13.806928 mon.sg01-06 mon.0 10.212.32.18:6789/0 3585 :
cluster [WRN] Health check update: 9 slow requests are blocked > 32 sec
(REQUEST_SLOW)
2018-02-06 13:46:18.807371 mon.sg01-06 mon.0 10.212.32.18:6789/0 3586 :
cluster [WRN] Health check update: 20 slow requests are blocked > 32 sec
(REQUEST_SLOW)
2018-02-06 13:46:23.807835 mon.sg01-06 mon.0 10.212.32.18:6789/0 3587 :
cluster [WRN] Health check update: 15 slow requests are blocked > 32 sec
(REQUEST_SLOW)
2018-02-06 13:46:19.082992 osd.53 osd.53 10.212.32.22:6802/50082 167 :
cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
30.071323 secs
2018-02-06 13:46:19.082999 osd.53 osd.53 10.212.32.22:6802/50082 168 :
cluster [WRN] slow request 30.071323 seconds old, received at 2018-02-06
13:45:49.011626: osd_op(client.10075.0:25660466 3.50964d18
3:18b2690a:::rbd_data.bcc718ab7e41.000000000001018d:head [write
2703360~8192] snapc 0=[] ack+ondisk+write+known_if_redirected e5954)
currently journaled_completion_queued
2018-02-06 13:46:21.083367 osd.53 osd.53 10.212.32.22:6802/50082 169 :
cluster [WRN] 2 slow requests, 1 included below; oldest blocked for >
32.071703 secs
2018-02-06 13:46:21.083390 osd.53 osd.53 10.212.32.22:6802/50082 170 :
cluster [WRN] slow request 30.141592 seconds old, received at 2018-02-06
13:45:50.941737: osd_op(client.841436.0:77497677 3.52e3e8ff
3:ff17c74a:::rbd_data.cd25332916c98.0000000000012b0b:head [write
245760~3612672] snapc 0=[] ack+ondisk+write+known_if_redirected e5954)
currently commit_sent
2018-02-06 13:46:28.808214 mon.sg01-06 mon.0 10.212.32.18:6789/0 3589 :
cluster [WRN] Health check update: 16 slow requests are blocked > 32 sec
(REQUEST_SLOW)
2018-02-06 13:46:33.808664 mon.sg01-06 mon.0 10.212.32.18:6789/0 3590 :
cluster [WRN] Health check update: 17 slow requests are blocked > 32 sec
(REQUEST_SLOW)
2018-02-06 13:46:38.809093 mon.sg01-06 mon.0 10.212.32.18:6789/0 3591 :
cluster [WRN] Health check update: 1 slow requests are blocked > 32 sec
(REQUEST_SLOW)
2018-02-06 13:46:43.809537 mon.sg01-06 mon.0 10.212.32.18:6789/0 3592 :
cluster [WRN] Health check update: 2 slow requests are blocked > 32 sec
(REQUEST_SLOW)
2018-02-06 13:46:44.760767 mon.sg01-06 mon.0 10.212.32.18:6789/0 3593 :
cluster [INF] Health check cleared: REQUEST_SLOW (was: 2 slow requests are
blocked > 32 sec)
2018-02-06 13:46:52.819718 mon.sg01-06 mon.0 10.212.32.18:6789/0 3594 :
cluster [WRN] Health check failed: 70 slow requests are blocked > 32 sec
(REQUEST_SLOW)
2018-02-06 13:46:59.674390 mon.sg01-06 mon.0 10.212.32.18:6789/0 3595 :
cluster [WRN] Health check update: 105 slow requests are blocked > 32 sec
(REQUEST_SLOW)
2018-02-06 13:47:01.444356 osd.91 osd.91 10.212.32.25:6818/81813 5 :
cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
30.028512 secs
2018-02-06 13:47:01.444365 osd.91 osd.91 10.212.32.25:6818/81813 6 :
cluster [WRN] slow request 30.028512 seconds old, received at 2018-02-06
13:46:31.415754: osd_op(client.757449.0:155792306 3.92bf145a
3:5a28fd49:::rbd_data.8e125f170c6e.00000000000318b3:head [write
176128~4096] snapc 0=[] ack+ondisk+write+known_if_redirected e5954)
currently sub_op_commit_rec from 72
2018-02-06 13:47:02.444612 osd.91 osd.91 10.212.32.25:6818/81813 7 :
cluster [WRN] 12 slow requests, 5 included below; oldest blocked for >
31.028754 secs
2018-02-06 13:47:02.444637 osd.91 osd.91 10.212.32.25:6818/81813 12 :
cluster [WRN] slow request 30.121723 seconds old, received at 2018-02-06
13:46:32.322785: osd_op(client.757449.0:155792349 3.92bf145a
3:5a28fd49:::rbd_data.8e125f170c6e.00000000000318b3:head [write
217088~4096] snapc 0=[] ack+ondisk+write+known_if_redirected e5954)
currently sub_op_commit_rec from 72
2018-02-06 13:47:01.089688 osd.53 osd.53 10.212.32.22:6802/50082 171 :
cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
30.546028 secs
2018-02-06 13:47:01.089695 osd.53 osd.53 10.212.32.22:6802/50082 172 :
cluster [WRN] slow request 30.546028 seconds old, received at 2018-02-06
13:46:30.543617: osd_op(client.48366.0:29692284 3.859d4e2b 3.859d4e2b
(undecoded) ack+ondisk+write+known_if_redirected e5954) currently
queued_for_pg
2018-02-06 13:47:02.089919 osd.53 osd.53 10.212.32.22:6802/50082 173 :
cluster [WRN] 9 slow requests, 5 included below; oldest blocked for >
31.546245 secs
2018-02-06 13:47:02.089926 osd.53 osd.53 10.212.32.22:6802/50082 174 :
cluster [WRN] slow request 30.414557 seconds old, received at 2018-02-06
13:46:31.675305: osd_repop(client.757449.0:155792317 3.5a e5954/5842)
currently queued_for_pg
2018-02-06 13:47:04.445182 osd.91 osd.91 10.212.32.25:6818/81813 24 :
cluster [WRN] slow request 30.496769 seconds old, received at 2018-02-06
13:46:33.948299: osd_op(client.757449.0:155792384 3.92bf145a
3:5a28fd49:::rbd_data.8e125f170c6e.00000000000318b3:head [write
262144~4096] snapc 0=[] ack+ondisk+write+known_if_redirected e5954)
currently sub_op_commit_rec from 72
2018-02-06 13:50:11.133054 osd.53 osd.53 10.212.32.22:6802/50082 864 :
cluster [WRN] slow request 120.604868 seconds old, received at 2018-02-06
13:48:10.528083: osd_op(client.840950.0:169909382 3.cf2d18ff 3.cf2d18ff
(undecoded) ack+ondisk+write+known_if_redirected e5954) currently
queued_for_pg
2018-02-06 13:50:11.133058 osd.53 osd.53 10.212.32.22:6802/50082 865 :
cluster [WRN] slow request 61.985051 seconds old, received at 2018-02-06
13:49:09.147900: osd_op(client.749595.0:78222152 3.4347d828 3.4347d828
(undecoded) ack+ondisk+write+known_if_redirected e5954) currently
queued_for_pg
2018-02-06 13:50:11.133061 osd.53 osd.53 10.212.32.22:6802/50082 866 :
cluster [WRN] slow request 31.256269 seconds old, received at 2018-02-06
13:49:39.876682: osd_op(client.749595.0:78223068 3.4347d828 3.4347d828
(undecoded) ack+ondisk+write+known_if_redirected e5954) currently
queued_for_pg
2018-02-06 13:50:12.133218 osd.53 osd.53 10.212.32.22:6802/50082 867 :
cluster [WRN] 48 slow requests, 1 included below; oldest blocked for >
168.672302 secs
2018-02-06 13:50:12.133223 osd.53 osd.53 10.212.32.22:6802/50082 868 :
cluster [WRN] slow request 34.801266 seconds old, received at 2018-02-06
13:49:37.331905: osd_op(client.48192.0:97243662 3.7163d828 3.7163d828
(undecoded) ack+ondisk+write+known_if_redirected e5954) currently
queued_for_pg
2018-02-06 13:50:18.139428 mon.sg01-06 mon.0 10.212.32.18:6789/0 3639 :
cluster [INF] Health check cleared: REQUEST_SLOW (was: 247 slow requests
are blocked > 32 sec)
2018-02-06 13:50:49.882654 mon.sg01-06 mon.0 10.212.32.18:6789/0 3641 :
cluster [WRN] overall HEALTH_WARN noscrub,nodeep-scrub flag(s) set

Testing with dd

# dd if=/dev/zero of=/var/lib/ceph/osd/ceph-53/dd_test bs=4M count=1000
oflag=direct
1000+0 records in
1000+0 records out
4194304000 bytes (4,2 GB, 3,9 GiB) copied, 29,3667 s, 143 MB/s
#

# dd if=/dev/zero of=/var/lib/ceph/osd/ceph-91/dd_test bs=4M count=1000
oflag=direct
1000+0 records in
1000+0 records out
4194304000 bytes (4,2 GB, 3,9 GiB) copied, 33,4732 s, 125 MB/s
#

Device Model: TOSHIBA MG04ACA400N
Firmware Version: FJ2D
User Capacity: 4,000,787,030,016 bytes [4,00 TB]
Sector Size: 512 bytes logical/physical
Rotation Rate: 7200 rpm
Form Factor: 3.5 inches
ATA Version is: ATA8-ACS (minor revision not indicated)
SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)


Please see some lines with number of files per directory in affected
filestore.
I think directories with 300+ files look like just merged or I get it wrong
?

...
337 in
/var/lib/ceph/osd/ceph-91/current/20.83s4_head/DIR_3/DIR_8/DIR_4/DIR_E
21 in
/var/lib/ceph/osd/ceph-91/current/20.83s4_head/DIR_3/DIR_8/DIR_4/DIR_F/DIR_0
18 in
/var/lib/ceph/osd/ceph-91/current/20.83s4_head/DIR_3/DIR_8/DIR_4/DIR_F/DIR_1
21 in
/var/lib/ceph/osd/ceph-91/current/20.83s4_head/DIR_3/DIR_8/DIR_4/DIR_F/DIR_6
25 in
/var/lib/ceph/osd/ceph-91/current/20.83s4_head/DIR_3/DIR_8/DIR_4/DIR_F/DIR_2
18 in
/var/lib/ceph/osd/ceph-91/current/20.83s4_head/DIR_3/DIR_8/DIR_4/DIR_F/DIR_3
22 in
/var/lib/ceph/osd/ceph-91/current/20.83s4_head/DIR_3/DIR_8/DIR_4/DIR_F/DIR_4
20 in
/var/lib/ceph/osd/ceph-91/current/20.83s4_head/DIR_3/DIR_8/DIR_4/DIR_F/DIR_5
27 in
/var/lib/ceph/osd/ceph-91/current/20.83s4_head/DIR_3/DIR_8/DIR_4/DIR_F/DIR_7
22 in
/var/lib/ceph/osd/ceph-91/current/20.83s4_head/DIR_3/DIR_8/DIR_4/DIR_F/DIR_8
25 in
/var/lib/ceph/osd/ceph-91/current/20.83s4_head/DIR_3/DIR_8/DIR_4/DIR_F/DIR_9
19 in
/var/lib/ceph/osd/ceph-91/current/20.83s4_head/DIR_3/DIR_8/DIR_4/DIR_F/DIR_A
22 in
/var/lib/ceph/osd/ceph-91/current/20.83s4_head/DIR_3/DIR_8/DIR_4/DIR_F/DIR_B
28 in
/var/lib/ceph/osd/ceph-91/current/20.83s4_head/DIR_3/DIR_8/DIR_4/DIR_F/DIR_C
20 in
/var/lib/ceph/osd/ceph-91/current/20.83s4_head/DIR_3/DIR_8/DIR_4/DIR_F/DIR_D
19 in
/var/lib/ceph/osd/ceph-91/current/20.83s4_head/DIR_3/DIR_8/DIR_4/DIR_F/DIR_E
28 in
/var/lib/ceph/osd/ceph-91/current/20.83s4_head/DIR_3/DIR_8/DIR_4/DIR_F/DIR_F
338 in
/var/lib/ceph/osd/ceph-91/current/20.83s4_head/DIR_3/DIR_8/DIR_8/DIR_0
344 in
/var/lib/ceph/osd/ceph-91/current/20.83s4_head/DIR_3/DIR_8/DIR_8/DIR_1
320 in
/var/lib/ceph/osd/ceph-91/current/20.83s4_head/DIR_3/DIR_8/DIR_8/DIR_2
50 in /var/lib/ceph/osd/ceph-91/current/20.83s4_head/DIR_3/DIR_8/DIR_8/DIR_3
21 in
/var/lib/ceph/osd/ceph-91/current/20.83s4_head/DIR_3/DIR_8/DIR_8/DIR_3/DIR_0
26 in
/var/lib/ceph/osd/ceph-91/current/20.83s4_head/DIR_3/DIR_8/DIR_8/DIR_3/DIR_1
20 in
/var/lib/ceph/osd/ceph-91/current/20.83s4_head/DIR_3/DIR_8/DIR_8/DIR_3/DIR_2
13 in
/var/lib/ceph/osd/ceph-91/current/20.83s4_head/DIR_3/DIR_8/DIR_8/DIR_3/DIR_3
28 in
/var/lib/ceph/osd/ceph-91/current/20.83s4_head/DIR_3/DIR_8/DIR_8/DIR_3/DIR_4
19 in
/var/lib/ceph/osd/ceph-91/current/20.83s4_head/DIR_3/DIR_8/DIR_8/DIR_3/DIR_5
20 in
/var/lib/ceph/osd/ceph-91/current/20.83s4_head/DIR_3/DIR_8/DIR_8/DIR_3/DIR_6
17 in
/var/lib/ceph/osd/ceph-91/current/20.83s4_head/DIR_3/DIR_8/DIR_8/DIR_3/DIR_7
21 in
/var/lib/ceph/osd/ceph-91/current/20.83s4_head/DIR_3/DIR_8/DIR_8/DIR_3/DIR_8
22 in
/var/lib/ceph/osd/ceph-91/current/20.83s4_head/DIR_3/DIR_8/DIR_8/DIR_3/DIR_9
23 in
/var/lib/ceph/osd/ceph-91/current/20.83s4_head/DIR_3/DIR_8/DIR_8/DIR_3/DIR_B
23 in
/var/lib/ceph/osd/ceph-91/current/20.83s4_head/DIR_3/DIR_8/DIR_8/DIR_3/DIR_E
26 in
/var/lib/ceph/osd/ceph-91/current/20.83s4_head/DIR_3/DIR_8/DIR_8/DIR_3/DIR_D
17 in
/var/lib/ceph/osd/ceph-91/current/20.83s4_head/DIR_3/DIR_8/DIR_8/DIR_3/DIR_F
325 in
/var/lib/ceph/osd/ceph-91/current/20.83s4_head/DIR_3/DIR_8/DIR_8/DIR_4
...

Does ceph-objectstore-tool merge/split filestore structure ? I can try it
tomorrow.


Many thanks
Jakub
​
Frédéric Nass
2018-02-06 16:36:17 UTC
Permalink
Raw Message
Hi Jakub,
Post by Jakub Jaszewski
​Hi Frederic,
I've not enable debug level logging on all OSDs, just on one for the
test, need to double check that.
But looks that merging is ongoing on few OSDs or OSDs are faulty, I
will dig into that tomorrow.
Write bandwidth is very random
I just reread the whole thread:

- Splitting is not happening anymore - if it ever did - that's for sure.
- Regarding the write bandwidth variations, it seems that these
variations only concern EC 6+3 pools.
- As you get more than a 1.2 GB/s on replicated pools with 4MB iops, I
would think that neither NVMe, nor PERC or HDDs is to blame.

Did you check CPU load during EC 6+3 writes on pool
default.rgw.buckets.data ?

If you don't see any 100% CPU load, nor any 100% iostat issues on either
the NVMe disk or HDDs, then I would benchmark the network for bandwidth
or latency issues.

BTW, did you see that some of your OSDs were not tagged as 'hdd' (ceph
osd df tree).
Post by Jakub Jaszewski
# rados bench -p default.rgw.buckets.data 120 write
hints = 1
Maintaining 16 concurrent writes of 4194432 bytes to objects of size
4194432 for up to 120 seconds or 0 objects
Object prefix: benchmark_data_sg08-09_59104
sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg
lat(s)
0       0         0         0         0         0  -           0
1      16       155       139    555.93   556.017  0.0750027     0.10687
2      16       264       248   495.936   436.013 0.154185    0.118693
3      16       330       314   418.616   264.008 0.118476    0.142667
4      16       415       399   398.953    340.01  0.0873379     0.15102
5      16       483       467   373.557   272.008 0.750453    0.159819
6      16       532       516   343.962   196.006  0.0298334    0.171218
7      16       617       601   343.391    340.01 0.192698    0.177288
8      16       700       684   341.963    332.01  0.0281355    0.171277
9      16       762       746   331.521   248.008  0.0962037    0.163734
 10      16       804       788   315.167   168.005  1.40356    0.196298
 11      16       897       881    320.33   372.011  0.0369085     0.19496
 12      16       985       969   322.966   352.011  0.0290563    0.193986
 13      15      1106      1091   335.657   488.015  0.0617642    0.188703
 14      16      1166      1150   328.537   236.007  0.0401884    0.186206
 15      16      1251      1235   329.299    340.01 0.171256    0.190974
 16      16      1339      1323   330.716   352.011 0.024222    0.189901
 17      16      1417      1401   329.613    312.01  0.0289473    0.186562
 18      16      1465      1449   321.967   192.006 0.028123    0.189153
 19      16      1522      1506    317.02   228.007 0.265448    0.188288
2018-02-06 13:43:21.412512 min lat: 0.0204657 max lat: 3.61509 avg
lat: 0.18918
sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg
lat(s)
 20      16      1564      1548   309.568   168.005  0.0327581     0.18918
 21      16      1636      1620    308.54   288.009  0.0715159    0.187381
 22      16      1673      1657   301.242   148.005  1.57285    0.191596
 23      16      1762      1746   303.621   356.011  6.00352    0.206217
 24      16      1885      1869   311.468   492.015  0.0298435    0.203874
 25      16      2010      1994   319.008   500.015  0.0258761    0.199652
 26      16      2116      2100   323.044   424.013  0.0533319     0.19631
 27      16      2201      2185    323.67    340.01 0.134796    0.195953
 28      16      2257      2241    320.11   224.007 0.473629    0.196464
 29      16      2333      2317   319.554   304.009  0.0362741    0.198054
 30      16      2371      2355   313.968   152.005 0.438141    0.200265
 31      16      2459      2443   315.194   352.011  0.0610629    0.200858
 32      16      2525      2509   313.593   264.008  0.0234799    0.201008
 33      16      2612      2596   314.635   348.011 0.072019    0.199094
 34      16      2682      2666   313.615   280.009  0.10062    0.197586
 35      16      2757      2741   313.225   300.009  0.0552581    0.196981
 36      16      2849      2833   314.746   368.011 0.257323     0.19565
 37      16      2891      2875   310.779   168.005  0.0918386     0.19556
 38      16      2946      2930    308.39   220.007  0.0276621    0.195792
 39      16      2975      2959   303.456   116.004  0.0588971     0.19952
0.198749
sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg
lat(s)
 40      16      3060      3044   304.369    340.01  0.0217136    0.198749
 41      16      3098      3082   300.652   152.005  0.0717398    0.199052
 42      16      3141      3125   297.589   172.005  0.0257422    0.201899
 43      15      3241      3226   300.063   404.012  0.0733869    0.209446
 44      16      3332      3316   301.424   360.011  0.0327249    0.206686
 45      16      3430      3414   303.436   392.012  0.0413156    0.203727
 46      16      3534      3518   305.882   416.013 0.033638    0.202182
 47      16      3602      3586   305.161   272.008  0.0453557    0.200028
 48      16      3663      3647   303.886   244.007  0.0779019    0.199777
 49      16      3736      3720   303.643   292.009  0.0285231    0.206274
 50      16      3849      3833   306.609   452.014  0.0537071    0.208127
 51      16      3909      3893   305.303   240.007  0.0366709    0.207793
 52      16      3972      3956   304.277   252.008  0.0289131    0.207989
 53      16      4048      4032   304.272   304.009  0.0348617    0.207844
 54      16      4114      4098   303.525   264.008  0.0799526     0.20701
 55      16      4191      4175   303.606   308.009 0.034915    0.206882
 56      16      4313      4297   306.898   488.015 0.108777    0.205869
 57      16      4449      4433   311.057   544.017  0.0862092    0.205232
 58      16      4548      4532    312.52   396.012 0.135814    0.203753
 59      16      4678      4662   316.036   520.016  0.0307446    0.202156
2018-02-06 13:44:01.417687 min lat: 0.0204657 max lat: 13.0888 avg
lat: 0.201848
sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg
lat(s)
 60      16      4734      4718   314.502   224.007 0.223548    0.201848
 61      16      4783      4767   312.559   196.006 0.124237    0.203641
 62      16      4838      4822   311.066   220.007 0.270943    0.204074
 63      16      4929      4913   311.905   364.011 0.123486    0.204716
 64      16      5063      5047   315.406   536.016 0.101675    0.202283
 65      16      5124      5108   314.307   244.007  0.0504368    0.201181
 66      16      5243      5227   316.756   476.015 0.235348    0.201481
 67      16      5339      5323   317.759   384.012  0.0478478    0.200486
 68      16      5414      5398   317.497   300.009 0.258401    0.200194
 69      16      5461      5445    315.62   188.006 0.112516     0.20022
 70      16      5523      5507   314.654   248.008  0.0927405    0.201534
 71      16      5582      5566   313.546   236.007 0.333586    0.202304
 72      16      5690      5674    315.19   432.013  0.0706812    0.201498
 73      16      5780      5764   315.804   360.011  0.0306772    0.200354
 74      16      5850      5834    315.32   280.009  0.0261027    0.200627
 75      16      5905      5889   314.048   220.007 0.101282     0.20075
 76      16      6013      5997     315.6   432.013 0.161956    0.202154
 77      16      6130      6114   317.578   468.014 0.042322     0.20092
 78      16      6232      6216   318.737   408.012  0.0912166    0.199238
 79      16      6260      6244    316.12   112.003  0.0406156    0.198971
2018-02-06 13:44:21.420343 min lat: 0.0204657 max lat: 13.0888 avg
lat: 0.200443
sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg
lat(s)
 80      16      6304      6288   314.368   176.005  0.0409417    0.200443
 81      16      6375      6359   313.993   284.009 0.150617     0.19986
 82      16      6447      6431   313.676   288.009  0.0730616    0.200072
 83      16      6513      6497   313.077   264.008  0.0565517    0.202155
 84      16      6594      6578   313.206    324.01  0.0273074    0.202981
 85      16      6732      6716   316.015   552.017 0.134015    0.202301
 86      16      6878      6862   319.131   584.018 0.157684    0.200182
 87      16      6971      6955   319.738   372.011 0.327923    0.199431
 88      16      7018      7002   318.241   188.006  0.0317034    0.198737
 89      16      7056      7040   316.373   152.005  0.0874119    0.199579
 90      16      7095      7079    314.59   156.005 0.027061    0.198883
 91      16      7150      7134   313.551   220.007 0.247791    0.199385
 92      16      7201      7185    312.36   204.006 0.131592    0.202395
 93      16      7216      7200   309.646   60.0018  0.0678182    0.202927
 94      16      7276      7260   308.905   240.007 0.040735    0.204362
 95      16      7304      7288   306.832   112.003  0.04629    0.204849
 96      16      7402      7386   307.719   392.012 0.085536    0.206601
 97      16      7524      7508   309.577   488.015 0.904426    0.205316
 98      16      7682      7666   312.866   632.019  0.0580214    0.204241
 99      16      7840      7824   316.089   632.019  0.13412    0.202289
2018-02-06 13:44:41.423052 min lat: 0.0204657 max lat: 13.0888 avg
lat: 0.200948
sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg
lat(s)
100      16      7949      7933   317.288   436.013 0.025568    0.200948
101      16      7996      7980   316.007   188.006 0.328559    0.201496
102      16      8020      8004    313.85   96.0029  0.0524293     0.20214
103      16      8044      8028   311.735   96.0029  0.0628428     0.20236
104      16      8062      8046    309.43   72.0022  0.0320078    0.202743
105      16      8088      8072   307.473   104.003  0.0506497    0.204222
106      16      8127      8111   306.044   156.005  0.0436112     0.20792
107      16      8226      8210   306.885   396.012  0.0452614    0.207666
108      16      8296      8280   306.635   280.009  0.0500199    0.207578
109      16      8367      8351   306.428   284.009  0.0364288    0.207779
110      16      8475      8459   307.569   432.013 0.140141    0.206752
111      16      8559      8543   307.825    336.01  0.0250007    0.206032
112      16      8696      8680   309.968   548.017  0.0249451    0.205808
113      16      8804      8788   311.048   432.013  0.49485    0.205075
114      16      8917      8901   312.284   452.014  0.0731665    0.204527
115      16      9047      9031    314.09   520.016  0.0365535    0.203553
116      16      9130      9114   314.244    332.01  0.0451302    0.203367
117      16      9259      9243   315.968   516.016  0.0610521    0.202253
118      16      9373      9357   317.154   456.014  0.0282051    0.201613
119      16      9465      9449   317.581   368.011  0.0299575    0.200845
2018-02-06 13:45:01.425689 min lat: 0.0204657 max lat: 13.0888 avg
lat: 0.200298
sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg
lat(s)
120      16      9545      9529   317.601    320.01  0.0264095    0.200298
121       7      9546      9539   315.307   40.0012 0.659852    0.201486
122       7      9546      9539   312.722         0  -    0.201486
Total time run:         122.442935
Total writes made:      9546
Write size:             4194432
Object size:            4194432
Bandwidth (MB/sec):     311.861
Stddev Bandwidth:       133.889
Max bandwidth (MB/sec): 632.019
Min bandwidth (MB/sec): 0
Average IOPS:           77
Stddev IOPS:            33
Max IOPS:               158
Min IOPS:               0
Average Latency(s):     0.203467
Stddev Latency(s):      0.531046
Max latency(s):         13.0888
Min latency(s):         0.0204657
Cleaning up (deleting benchmark objects)
Removed 9546 objects
Clean up completed and total clean up time :308.444893
#
Many slow requests occurred in log once Cleaning up started. Affected
OSDs: osd.53 and osd.91 are located on different nodes.
Slow requests on cleanup is surely due to merging.
Post by Jakub Jaszewski
2018-02-06 13:42:49.881647 mon.sg01-06 mon.0 10.212.32.18:6789/0
<http://10.212.32.18:6789/0> 3574 : cluster [WRN] overall HEALTH_WARN
noscrub,nodeep-scrub flag(s) set
2018-02-06 13:44:49.777466 mon.sg01-06 mon.0 10.212.32.18:6789/0
<http://10.212.32.18:6789/0> 3577 : cluster [INF] mon.1
10.212.32.19:6789/0 <http://10.212.32.19:6789/0>
2018-02-06 13:44:49.777627 mon.sg01-06 mon.0 10.212.32.18:6789/0
<http://10.212.32.18:6789/0> 3578 : cluster [INF] mon.2
10.212.32.20:6789/0 <http://10.212.32.20:6789/0>
2018-02-06 13:45:22.221472 mon.sg01-06 mon.0 10.212.32.18:6789/0
4 slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-02-06 13:45:26.227333 mon.sg01-06 mon.0 10.212.32.18:6789/0
<http://10.212.32.18:6789/0> 3580 : cluster [INF] Health check
cleared: REQUEST_SLOW (was: 4 slow requests are blocked > 32 sec)
2018-02-06 13:45:40.325885 mon.sg01-06 mon.0 10.212.32.18:6789/0
26 slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-02-06 13:45:46.360383 mon.sg01-06 mon.0 10.212.32.18:6789/0
<http://10.212.32.18:6789/0> 3583 : cluster [INF] Health check
cleared: REQUEST_SLOW (was: 26 slow requests are blocked > 32 sec)
2018-02-06 13:46:08.518861 mon.sg01-06 mon.0 10.212.32.18:6789/0
1 slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-02-06 13:46:13.806928 mon.sg01-06 mon.0 10.212.32.18:6789/0
9 slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-02-06 13:46:18.807371 mon.sg01-06 mon.0 10.212.32.18:6789/0
20 slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-02-06 13:46:23.807835 mon.sg01-06 mon.0 10.212.32.18:6789/0
15 slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-02-06 13:46:19.082992 osd.53 osd.53 10.212.32.22:6802/50082
<http://10.212.32.22:6802/50082> 167 : cluster [WRN] 1 slow requests,
1 included below; oldest blocked for > 30.071323 secs
2018-02-06 13:46:19.082999 osd.53 osd.53 10.212.32.22:6802/50082
<http://10.212.32.22:6802/50082> 168 : cluster [WRN] slow request
osd_op(client.10075.0:25660466 3.50964d18
3:18b2690a:::rbd_data.bcc718ab7e41.000000000001018d:head [write
2703360~8192] snapc 0=[] ack+ondisk+write+known_if_redirected e5954)
currently journaled_completion_queued
2018-02-06 13:46:21.083367 osd.53 osd.53 10.212.32.22:6802/50082
<http://10.212.32.22:6802/50082> 169 : cluster [WRN] 2 slow requests,
1 included below; oldest blocked for > 32.071703 secs
2018-02-06 13:46:21.083390 osd.53 osd.53 10.212.32.22:6802/50082
<http://10.212.32.22:6802/50082> 170 : cluster [WRN] slow request
osd_op(client.841436.0:77497677 3.52e3e8ff
3:ff17c74a:::rbd_data.cd25332916c98.0000000000012b0b:head [write
245760~3612672] snapc 0=[] ack+ondisk+write+known_if_redirected e5954)
currently commit_sent
2018-02-06 13:46:28.808214 mon.sg01-06 mon.0 10.212.32.18:6789/0
16 slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-02-06 13:46:33.808664 mon.sg01-06 mon.0 10.212.32.18:6789/0
17 slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-02-06 13:46:38.809093 mon.sg01-06 mon.0 10.212.32.18:6789/0
1 slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-02-06 13:46:43.809537 mon.sg01-06 mon.0 10.212.32.18:6789/0
2 slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-02-06 13:46:44.760767 mon.sg01-06 mon.0 10.212.32.18:6789/0
<http://10.212.32.18:6789/0> 3593 : cluster [INF] Health check
cleared: REQUEST_SLOW (was: 2 slow requests are blocked > 32 sec)
2018-02-06 13:46:52.819718 mon.sg01-06 mon.0 10.212.32.18:6789/0
70 slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-02-06 13:46:59.674390 mon.sg01-06 mon.0 10.212.32.18:6789/0
105 slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-02-06 13:47:01.444356 osd.91 osd.91 10.212.32.25:6818/81813
<http://10.212.32.25:6818/81813> 5 : cluster [WRN] 1 slow requests, 1
included below; oldest blocked for > 30.028512 secs
2018-02-06 13:47:01.444365 osd.91 osd.91 10.212.32.25:6818/81813
<http://10.212.32.25:6818/81813> 6 : cluster [WRN] slow request
osd_op(client.757449.0:155792306 3.92bf145a
3:5a28fd49:::rbd_data.8e125f170c6e.00000000000318b3:head [write
176128~4096] snapc 0=[] ack+ondisk+write+known_if_redirected e5954)
currently sub_op_commit_rec from 72
2018-02-06 13:47:02.444612 osd.91 osd.91 10.212.32.25:6818/81813
<http://10.212.32.25:6818/81813> 7 : cluster [WRN] 12 slow requests, 5
included below; oldest blocked for > 31.028754 secs
2018-02-06 13:47:02.444637 osd.91 osd.91 10.212.32.25:6818/81813
<http://10.212.32.25:6818/81813> 12 : cluster [WRN] slow request
osd_op(client.757449.0:155792349 3.92bf145a
3:5a28fd49:::rbd_data.8e125f170c6e.00000000000318b3:head [write
217088~4096] snapc 0=[] ack+ondisk+write+known_if_redirected e5954)
currently sub_op_commit_rec from 72
2018-02-06 13:47:01.089688 osd.53 osd.53 10.212.32.22:6802/50082
<http://10.212.32.22:6802/50082> 171 : cluster [WRN] 1 slow requests,
1 included below; oldest blocked for > 30.546028 secs
2018-02-06 13:47:01.089695 osd.53 osd.53 10.212.32.22:6802/50082
<http://10.212.32.22:6802/50082> 172 : cluster [WRN] slow request
osd_op(client.48366.0:29692284 3.859d4e2b 3.859d4e2b (undecoded)
ack+ondisk+write+known_if_redirected e5954) currently queued_for_pg
2018-02-06 13:47:02.089919 osd.53 osd.53 10.212.32.22:6802/50082
<http://10.212.32.22:6802/50082> 173 : cluster [WRN] 9 slow requests,
5 included below; oldest blocked for > 31.546245 secs
2018-02-06 13:47:02.089926 osd.53 osd.53 10.212.32.22:6802/50082
<http://10.212.32.22:6802/50082> 174 : cluster [WRN] slow request
osd_repop(client.757449.0:155792317 3.5a e5954/5842) currently
queued_for_pg
2018-02-06 13:47:04.445182 osd.91 osd.91 10.212.32.25:6818/81813
<http://10.212.32.25:6818/81813> 24 : cluster [WRN] slow request
osd_op(client.757449.0:155792384 3.92bf145a
3:5a28fd49:::rbd_data.8e125f170c6e.00000000000318b3:head [write
262144~4096] snapc 0=[] ack+ondisk+write+known_if_redirected e5954)
currently sub_op_commit_rec from 72
2018-02-06 13:50:11.133054 osd.53 osd.53 10.212.32.22:6802/50082
<http://10.212.32.22:6802/50082> 864 : cluster [WRN] slow request
osd_op(client.840950.0:169909382 3.cf2d18ff 3.cf2d18ff (undecoded)
ack+ondisk+write+known_if_redirected e5954) currently queued_for_pg
2018-02-06 13:50:11.133058 osd.53 osd.53 10.212.32.22:6802/50082
<http://10.212.32.22:6802/50082> 865 : cluster [WRN] slow request
osd_op(client.749595.0:78222152 3.4347d828 3.4347d828 (undecoded)
ack+ondisk+write+known_if_redirected e5954) currently queued_for_pg
2018-02-06 13:50:11.133061 osd.53 osd.53 10.212.32.22:6802/50082
<http://10.212.32.22:6802/50082> 866 : cluster [WRN] slow request
osd_op(client.749595.0:78223068 3.4347d828 3.4347d828 (undecoded)
ack+ondisk+write+known_if_redirected e5954) currently queued_for_pg
2018-02-06 13:50:12.133218 osd.53 osd.53 10.212.32.22:6802/50082
<http://10.212.32.22:6802/50082> 867 : cluster [WRN] 48 slow requests,
1 included below; oldest blocked for > 168.672302 secs
2018-02-06 13:50:12.133223 osd.53 osd.53 10.212.32.22:6802/50082
<http://10.212.32.22:6802/50082> 868 : cluster [WRN] slow request
osd_op(client.48192.0:97243662 3.7163d828 3.7163d828 (undecoded)
ack+ondisk+write+known_if_redirected e5954) currently queued_for_pg
2018-02-06 13:50:18.139428 mon.sg01-06 mon.0 10.212.32.18:6789/0
<http://10.212.32.18:6789/0> 3639 : cluster [INF] Health check
cleared: REQUEST_SLOW (was: 247 slow requests are blocked > 32 sec)
2018-02-06 13:50:49.882654 mon.sg01-06 mon.0 10.212.32.18:6789/0
<http://10.212.32.18:6789/0> 3641 : cluster [WRN] overall HEALTH_WARN
noscrub,nodeep-scrub flag(s) set
Testing with dd
# dd if=/dev/zero of=/var/lib/ceph/osd/ceph-53/dd_test bs=4M
count=1000 oflag=direct
1000+0 records in
1000+0 records out
4194304000 bytes (4,2 GB, 3,9 GiB) copied, 29,3667 s, 143 MB/s
#
# dd if=/dev/zero of=/var/lib/ceph/osd/ceph-91/dd_test bs=4M
count=1000 oflag=direct
1000+0 records in
1000+0 records out
4194304000 bytes (4,2 GB, 3,9 GiB) copied, 33,4732 s, 125 MB/s
#
Device Model:     TOSHIBA MG04ACA400N
Firmware Version: FJ2D
User Capacity:    4,000,787,030,016 bytes [4,00 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
ATA Version is:   ATA8-ACS (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Please see some lines with number of files per directory in affected
filestore.
I think directories with 300+ files look like just merged or I get it
wrong ?
That's probably right, but now that there's a random factor, you may
have had more than 320 files per subdirectory with previous defaults
split and merge values.

But yeah, the blocked requests surely come from merging operations.
Post by Jakub Jaszewski
...
337 in
/var/lib/ceph/osd/ceph-91/current/20.83s4_head/DIR_3/DIR_8/DIR_4/DIR_E
21 in
/var/lib/ceph/osd/ceph-91/current/20.83s4_head/DIR_3/DIR_8/DIR_4/DIR_F/DIR_0
18 in
/var/lib/ceph/osd/ceph-91/current/20.83s4_head/DIR_3/DIR_8/DIR_4/DIR_F/DIR_1
21 in
/var/lib/ceph/osd/ceph-91/current/20.83s4_head/DIR_3/DIR_8/DIR_4/DIR_F/DIR_6
25 in
/var/lib/ceph/osd/ceph-91/current/20.83s4_head/DIR_3/DIR_8/DIR_4/DIR_F/DIR_2
18 in
/var/lib/ceph/osd/ceph-91/current/20.83s4_head/DIR_3/DIR_8/DIR_4/DIR_F/DIR_3
22 in
/var/lib/ceph/osd/ceph-91/current/20.83s4_head/DIR_3/DIR_8/DIR_4/DIR_F/DIR_4
20 in
/var/lib/ceph/osd/ceph-91/current/20.83s4_head/DIR_3/DIR_8/DIR_4/DIR_F/DIR_5
27 in
/var/lib/ceph/osd/ceph-91/current/20.83s4_head/DIR_3/DIR_8/DIR_4/DIR_F/DIR_7
22 in
/var/lib/ceph/osd/ceph-91/current/20.83s4_head/DIR_3/DIR_8/DIR_4/DIR_F/DIR_8
25 in
/var/lib/ceph/osd/ceph-91/current/20.83s4_head/DIR_3/DIR_8/DIR_4/DIR_F/DIR_9
19 in
/var/lib/ceph/osd/ceph-91/current/20.83s4_head/DIR_3/DIR_8/DIR_4/DIR_F/DIR_A
22 in
/var/lib/ceph/osd/ceph-91/current/20.83s4_head/DIR_3/DIR_8/DIR_4/DIR_F/DIR_B
28 in
/var/lib/ceph/osd/ceph-91/current/20.83s4_head/DIR_3/DIR_8/DIR_4/DIR_F/DIR_C
20 in
/var/lib/ceph/osd/ceph-91/current/20.83s4_head/DIR_3/DIR_8/DIR_4/DIR_F/DIR_D
19 in
/var/lib/ceph/osd/ceph-91/current/20.83s4_head/DIR_3/DIR_8/DIR_4/DIR_F/DIR_E
28 in
/var/lib/ceph/osd/ceph-91/current/20.83s4_head/DIR_3/DIR_8/DIR_4/DIR_F/DIR_F
338 in
/var/lib/ceph/osd/ceph-91/current/20.83s4_head/DIR_3/DIR_8/DIR_8/DIR_0
344 in
/var/lib/ceph/osd/ceph-91/current/20.83s4_head/DIR_3/DIR_8/DIR_8/DIR_1
320 in
/var/lib/ceph/osd/ceph-91/current/20.83s4_head/DIR_3/DIR_8/DIR_8/DIR_2
50 in
/var/lib/ceph/osd/ceph-91/current/20.83s4_head/DIR_3/DIR_8/DIR_8/DIR_3
21 in
/var/lib/ceph/osd/ceph-91/current/20.83s4_head/DIR_3/DIR_8/DIR_8/DIR_3/DIR_0
26 in
/var/lib/ceph/osd/ceph-91/current/20.83s4_head/DIR_3/DIR_8/DIR_8/DIR_3/DIR_1
20 in
/var/lib/ceph/osd/ceph-91/current/20.83s4_head/DIR_3/DIR_8/DIR_8/DIR_3/DIR_2
13 in
/var/lib/ceph/osd/ceph-91/current/20.83s4_head/DIR_3/DIR_8/DIR_8/DIR_3/DIR_3
28 in
/var/lib/ceph/osd/ceph-91/current/20.83s4_head/DIR_3/DIR_8/DIR_8/DIR_3/DIR_4
19 in
/var/lib/ceph/osd/ceph-91/current/20.83s4_head/DIR_3/DIR_8/DIR_8/DIR_3/DIR_5
20 in
/var/lib/ceph/osd/ceph-91/current/20.83s4_head/DIR_3/DIR_8/DIR_8/DIR_3/DIR_6
17 in
/var/lib/ceph/osd/ceph-91/current/20.83s4_head/DIR_3/DIR_8/DIR_8/DIR_3/DIR_7
21 in
/var/lib/ceph/osd/ceph-91/current/20.83s4_head/DIR_3/DIR_8/DIR_8/DIR_3/DIR_8
22 in
/var/lib/ceph/osd/ceph-91/current/20.83s4_head/DIR_3/DIR_8/DIR_8/DIR_3/DIR_9
23 in
/var/lib/ceph/osd/ceph-91/current/20.83s4_head/DIR_3/DIR_8/DIR_8/DIR_3/DIR_B
23 in
/var/lib/ceph/osd/ceph-91/current/20.83s4_head/DIR_3/DIR_8/DIR_8/DIR_3/DIR_E
26 in
/var/lib/ceph/osd/ceph-91/current/20.83s4_head/DIR_3/DIR_8/DIR_8/DIR_3/DIR_D
17 in
/var/lib/ceph/osd/ceph-91/current/20.83s4_head/DIR_3/DIR_8/DIR_8/DIR_3/DIR_F
325 in
/var/lib/ceph/osd/ceph-91/current/20.83s4_head/DIR_3/DIR_8/DIR_8/DIR_4
...
Does ceph-objectstore-tool merge/split filestore structure ? I can try
it tomorrow.
It only does the splitting part. You'd might want to set
filestore_merge_threshold to -40 to not merge, OR provoke merging off
peak hours with something like this :

while true ; do rados bench -p default.rgw.buckets.data 10 write -b 4K
-t 16 ; sleep 60 ; done

Every of the 1024 PGs should rapidly see an object write and delete and
then be merged.

Regards,

Frédéric.
Post by Jakub Jaszewski
Many thanks
Jakub
​
Loading...