Discussion:
Ceph GET latency
(too old to reply)
Guang Yang
2014-02-18 15:24:34 UTC
Permalink
Hi ceph-users,
We are using Ceph (radosgw) to store user generated images, as GET latency is critical for us, most recently I did some investigation over the GET path to understand where time spend.

I first confirmed that the latency came from OSD (read op), so that we instrumented code to trace the GET request (read op at OSD side, to be more specific, each object with size [512K + 4M * x] ?are splitted into [1 + x] chunks, each chunk needs one read op ), for each read op, it needs to go through the following steps:
? ? 1. Dispatch and take by a op thread to process (process not started).
? ? ? ? ? ? ?0 ? ? 20 ms, ? ?94%
? ? ? ? ? ? ?20 ? 50 ms, ? ?2%
? ? ? ? ? ? ?50 ? 100 ms, ?2%
? ? ? ? ? ? ? 100ms+ ? , ? ? ? ? 2%
? ? ? ? ?For those having 20ms+ latency, half of them are due to waiting for pg lock (https://github.com/ceph/ceph/blob/dumpling/src/osd/OSD.cc#L7089), another half are yet to be investigated.

? ? 2. Get file xattr (?-?), which open the file and populate fd cache (https://github.com/ceph/ceph/blob/dumpling/src/os/FileStore.cc#L230).
? ? ? ? ? ? ? 0 ? ? 20 ms, ?80%
? ? ? ? ? ? ? 20 ? 50 ms, ? 8%
? ? ? ? ? ? ? 50 ? 100 ms, 7%
? ? ? ? ? ? ? 100ms+ ? , ? ? ?5%
? ? ? ? ?The latency either comes from (from more to less): file path lookup (https://github.com/ceph/ceph/blob/dumpling/src/os/HashIndex.cc#L294), file open, or fd cache lookup /add.
? ? ? ? ?Currently objects are store in level 6 or level 7 folder (due to?http://tracker.ceph.com/issues/7207, I stopped folder splitting).

? ? 3. Get more xattrs, this is fast due to previous fd cache (rarely > 1ms).

? ? 4. Read the data.
? ? ? ? ? ? 0 ? ? 20 ms, ? 84%
? ? ? ? ? ? 20 ? 50 ms, 10%
? ? ? ? ? ? 50 ? 100 ms, 4%
? ? ? ? ? ? 100ms+ ? ? ? ?, 2%

I decreased?vfs_cache_pressure from its default value 100 to 5 to make VFS favor dentry/inode cache over page cache, unfortunately it does not help.

Long story short, most of the long latency read op comes from file system call (for cold data), as our workload mainly stores objects less than 500KB, so that it generates a large bunch of objects.

I would like to ask if people experienced similar issue and if there is any suggestion I can try to boost the GET performance. On the other hand, PUT could be sacrificed.

Thanks,
Guang
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140218/f759f9c4/attachment.htm>
Yehuda Sadeh
2014-02-19 19:09:22 UTC
Permalink
Post by Guang Yang
Hi ceph-users,
We are using Ceph (radosgw) to store user generated images, as GET latency
is critical for us, most recently I did some investigation over the GET path
to understand where time spend.
I first confirmed that the latency came from OSD (read op), so that we
instrumented code to trace the GET request (read op at OSD side, to be more
specific, each object with size [512K + 4M * x] are splitted into [1 + x]
chunks, each chunk needs one read op ), for each read op, it needs to go
1. Dispatch and take by a op thread to process (process not started).
0 - 20 ms, 94%
20 - 50 ms, 2%
50 - 100 ms, 2%
100ms+ , 2%
For those having 20ms+ latency, half of them are due to waiting for
pg lock (https://github.com/ceph/ceph/blob/dumpling/src/osd/OSD.cc#L7089),
another half are yet to be investigated.
2. Get file xattr ('-'), which open the file and populate fd cache
(https://github.com/ceph/ceph/blob/dumpling/src/os/FileStore.cc#L230).
0 - 20 ms, 80%
20 - 50 ms, 8%
50 - 100 ms, 7%
100ms+ , 5%
The latency either comes from (from more to less): file path lookup
(https://github.com/ceph/ceph/blob/dumpling/src/os/HashIndex.cc#L294), file
open, or fd cache lookup /add.
Currently objects are store in level 6 or level 7 folder (due to
http://tracker.ceph.com/issues/7207, I stopped folder splitting).
3. Get more xattrs, this is fast due to previous fd cache (rarely >
1ms).
4. Read the data.
0 - 20 ms, 84%
20 - 50 ms, 10%
50 - 100 ms, 4%
100ms+ , 2%
I decreased vfs_cache_pressure from its default value 100 to 5 to make VFS
favor dentry/inode cache over page cache, unfortunately it does not help.
Long story short, most of the long latency read op comes from file system
call (for cold data), as our workload mainly stores objects less than 500KB,
so that it generates a large bunch of objects.
I would like to ask if people experienced similar issue and if there is any
suggestion I can try to boost the GET performance. On the other hand, PUT
could be sacrificed.
Try looking at the perfcounters, see if there's any other throttling
happening. Also, make sure you have enough pgs for your data pool. One
other thing to try is disabling leveldb xattrs and see if it affects
your latency.

Yehuda
GuangYang
2014-02-20 06:27:52 UTC
Permalink
Thanks Yehuda.
Post by Yehuda Sadeh
Try looking at the perfcounters, see if there's any other throttling
happening. Also, make sure you have enough pgs for your data pool. One
other thing to try is disabling leveldb xattrs and see if it affects
your latency.
1. There is not throttling happening.
2. According to ceph recommendation, there is 100PGs per OSD and we bump that to 200.
3. Can you elaborate on the xattrs stuff, how could that affect the get latency?

Thanks,
Guang
Post by Yehuda Sadeh
Post by Guang Yang
Hi ceph-users,
We are using Ceph (radosgw) to store user generated images, as GET latency
is critical for us, most recently I did some investigation over the GET path
to understand where time spend.
I first confirmed that the latency came from OSD (read op), so that we
instrumented code to trace the GET request (read op at OSD side, to be more
specific, each object with size [512K + 4M * x] are splitted into [1 + x]
chunks, each chunk needs one read op ), for each read op, it needs to go
1. Dispatch and take by a op thread to process (process not started).
0 - 20 ms, 94%
20 - 50 ms, 2%
50 - 100 ms, 2%
100ms+ , 2%
For those having 20ms+ latency, half of them are due to waiting for
pg lock (https://github.com/ceph/ceph/blob/dumpling/src/osd/OSD.cc#L7089),
another half are yet to be investigated.
2. Get file xattr ('-'), which open the file and populate fd cache
(https://github.com/ceph/ceph/blob/dumpling/src/os/FileStore.cc#L230).
0 - 20 ms, 80%
20 - 50 ms, 8%
50 - 100 ms, 7%
100ms+ , 5%
The latency either comes from (from more to less): file path lookup
(https://github.com/ceph/ceph/blob/dumpling/src/os/HashIndex.cc#L294), file
open, or fd cache lookup /add.
Currently objects are store in level 6 or level 7 folder (due to
http://tracker.ceph.com/issues/7207, I stopped folder splitting).
3. Get more xattrs, this is fast due to previous fd cache (rarely >
1ms).
4. Read the data.
0 - 20 ms, 84%
20 - 50 ms, 10%
50 - 100 ms, 4%
100ms+ , 2%
I decreased vfs_cache_pressure from its default value 100 to 5 to make VFS
favor dentry/inode cache over page cache, unfortunately it does not help.
Long story short, most of the long latency read op comes from file system
call (for cold data), as our workload mainly stores objects less than 500KB,
so that it generates a large bunch of objects.
I would like to ask if people experienced similar issue and if there is any
suggestion I can try to boost the GET performance. On the other hand, PUT
could be sacrificed.
Try looking at the perfcounters, see if there's any other throttling
happening. Also, make sure you have enough pgs for your data pool. One
other thing to try is disabling leveldb xattrs and see if it affects
your latency.
Yehuda
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140220/c951059e/attachment.htm>
GuangYang
2014-02-20 06:27:52 UTC
Permalink
Thanks Yehuda.
Post by Yehuda Sadeh
Try looking at the perfcounters, see if there's any other throttling
happening. Also, make sure you have enough pgs for your data pool. One
other thing to try is disabling leveldb xattrs and see if it affects
your latency.
1. There is not throttling happening.
2. According to ceph recommendation, there is 100PGs per OSD and we bump that to 200.
3. Can you elaborate on the xattrs stuff, how could that affect the get latency?

Thanks,
Guang
Post by Yehuda Sadeh
Post by Guang Yang
Hi ceph-users,
We are using Ceph (radosgw) to store user generated images, as GET latency
is critical for us, most recently I did some investigation over the GET path
to understand where time spend.
I first confirmed that the latency came from OSD (read op), so that we
instrumented code to trace the GET request (read op at OSD side, to be more
specific, each object with size [512K + 4M * x] are splitted into [1 + x]
chunks, each chunk needs one read op ), for each read op, it needs to go
1. Dispatch and take by a op thread to process (process not started).
0 - 20 ms, 94%
20 - 50 ms, 2%
50 - 100 ms, 2%
100ms+ , 2%
For those having 20ms+ latency, half of them are due to waiting for
pg lock (https://github.com/ceph/ceph/blob/dumpling/src/osd/OSD.cc#L7089),
another half are yet to be investigated.
2. Get file xattr ('-'), which open the file and populate fd cache
(https://github.com/ceph/ceph/blob/dumpling/src/os/FileStore.cc#L230).
0 - 20 ms, 80%
20 - 50 ms, 8%
50 - 100 ms, 7%
100ms+ , 5%
The latency either comes from (from more to less): file path lookup
(https://github.com/ceph/ceph/blob/dumpling/src/os/HashIndex.cc#L294), file
open, or fd cache lookup /add.
Currently objects are store in level 6 or level 7 folder (due to
http://tracker.ceph.com/issues/7207, I stopped folder splitting).
3. Get more xattrs, this is fast due to previous fd cache (rarely >
1ms).
4. Read the data.
0 - 20 ms, 84%
20 - 50 ms, 10%
50 - 100 ms, 4%
100ms+ , 2%
I decreased vfs_cache_pressure from its default value 100 to 5 to make VFS
favor dentry/inode cache over page cache, unfortunately it does not help.
Long story short, most of the long latency read op comes from file system
call (for cold data), as our workload mainly stores objects less than 500KB,
so that it generates a large bunch of objects.
I would like to ask if people experienced similar issue and if there is any
suggestion I can try to boost the GET performance. On the other hand, PUT
could be sacrificed.
Try looking at the perfcounters, see if there's any other throttling
happening. Also, make sure you have enough pgs for your data pool. One
other thing to try is disabling leveldb xattrs and see if it affects
your latency.
Yehuda
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140220/c951059e/attachment-0002.htm>
GuangYang
2014-02-20 06:27:52 UTC
Permalink
Thanks Yehuda.
Post by Yehuda Sadeh
Try looking at the perfcounters, see if there's any other throttling
happening. Also, make sure you have enough pgs for your data pool. One
other thing to try is disabling leveldb xattrs and see if it affects
your latency.
1. There is not throttling happening.
2. According to ceph recommendation, there is 100PGs per OSD and we bump that to 200.
3. Can you elaborate on the xattrs stuff, how could that affect the get latency?

Thanks,
Guang
Post by Yehuda Sadeh
Post by Guang Yang
Hi ceph-users,
We are using Ceph (radosgw) to store user generated images, as GET latency
is critical for us, most recently I did some investigation over the GET path
to understand where time spend.
I first confirmed that the latency came from OSD (read op), so that we
instrumented code to trace the GET request (read op at OSD side, to be more
specific, each object with size [512K + 4M * x] are splitted into [1 + x]
chunks, each chunk needs one read op ), for each read op, it needs to go
1. Dispatch and take by a op thread to process (process not started).
0 - 20 ms, 94%
20 - 50 ms, 2%
50 - 100 ms, 2%
100ms+ , 2%
For those having 20ms+ latency, half of them are due to waiting for
pg lock (https://github.com/ceph/ceph/blob/dumpling/src/osd/OSD.cc#L7089),
another half are yet to be investigated.
2. Get file xattr ('-'), which open the file and populate fd cache
(https://github.com/ceph/ceph/blob/dumpling/src/os/FileStore.cc#L230).
0 - 20 ms, 80%
20 - 50 ms, 8%
50 - 100 ms, 7%
100ms+ , 5%
The latency either comes from (from more to less): file path lookup
(https://github.com/ceph/ceph/blob/dumpling/src/os/HashIndex.cc#L294), file
open, or fd cache lookup /add.
Currently objects are store in level 6 or level 7 folder (due to
http://tracker.ceph.com/issues/7207, I stopped folder splitting).
3. Get more xattrs, this is fast due to previous fd cache (rarely >
1ms).
4. Read the data.
0 - 20 ms, 84%
20 - 50 ms, 10%
50 - 100 ms, 4%
100ms+ , 2%
I decreased vfs_cache_pressure from its default value 100 to 5 to make VFS
favor dentry/inode cache over page cache, unfortunately it does not help.
Long story short, most of the long latency read op comes from file system
call (for cold data), as our workload mainly stores objects less than 500KB,
so that it generates a large bunch of objects.
I would like to ask if people experienced similar issue and if there is any
suggestion I can try to boost the GET performance. On the other hand, PUT
could be sacrificed.
Try looking at the perfcounters, see if there's any other throttling
happening. Also, make sure you have enough pgs for your data pool. One
other thing to try is disabling leveldb xattrs and see if it affects
your latency.
Yehuda
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140220/c951059e/attachment-0003.htm>
GuangYang
2014-02-20 06:27:52 UTC
Permalink
Thanks Yehuda.
Post by Yehuda Sadeh
Try looking at the perfcounters, see if there's any other throttling
happening. Also, make sure you have enough pgs for your data pool. One
other thing to try is disabling leveldb xattrs and see if it affects
your latency.
1. There is not throttling happening.
2. According to ceph recommendation, there is 100PGs per OSD and we bump that to 200.
3. Can you elaborate on the xattrs stuff, how could that affect the get latency?

Thanks,
Guang
Post by Yehuda Sadeh
Post by Guang Yang
Hi ceph-users,
We are using Ceph (radosgw) to store user generated images, as GET latency
is critical for us, most recently I did some investigation over the GET path
to understand where time spend.
I first confirmed that the latency came from OSD (read op), so that we
instrumented code to trace the GET request (read op at OSD side, to be more
specific, each object with size [512K + 4M * x] are splitted into [1 + x]
chunks, each chunk needs one read op ), for each read op, it needs to go
1. Dispatch and take by a op thread to process (process not started).
0 - 20 ms, 94%
20 - 50 ms, 2%
50 - 100 ms, 2%
100ms+ , 2%
For those having 20ms+ latency, half of them are due to waiting for
pg lock (https://github.com/ceph/ceph/blob/dumpling/src/osd/OSD.cc#L7089),
another half are yet to be investigated.
2. Get file xattr ('-'), which open the file and populate fd cache
(https://github.com/ceph/ceph/blob/dumpling/src/os/FileStore.cc#L230).
0 - 20 ms, 80%
20 - 50 ms, 8%
50 - 100 ms, 7%
100ms+ , 5%
The latency either comes from (from more to less): file path lookup
(https://github.com/ceph/ceph/blob/dumpling/src/os/HashIndex.cc#L294), file
open, or fd cache lookup /add.
Currently objects are store in level 6 or level 7 folder (due to
http://tracker.ceph.com/issues/7207, I stopped folder splitting).
3. Get more xattrs, this is fast due to previous fd cache (rarely >
1ms).
4. Read the data.
0 - 20 ms, 84%
20 - 50 ms, 10%
50 - 100 ms, 4%
100ms+ , 2%
I decreased vfs_cache_pressure from its default value 100 to 5 to make VFS
favor dentry/inode cache over page cache, unfortunately it does not help.
Long story short, most of the long latency read op comes from file system
call (for cold data), as our workload mainly stores objects less than 500KB,
so that it generates a large bunch of objects.
I would like to ask if people experienced similar issue and if there is any
suggestion I can try to boost the GET performance. On the other hand, PUT
could be sacrificed.
Try looking at the perfcounters, see if there's any other throttling
happening. Also, make sure you have enough pgs for your data pool. One
other thing to try is disabling leveldb xattrs and see if it affects
your latency.
Yehuda
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140220/c951059e/attachment-0004.htm>
Gregory Farnum
2014-02-20 15:16:29 UTC
Permalink
Post by Guang Yang
Hi ceph-users,
We are using Ceph (radosgw) to store user generated images, as GET latency
is critical for us, most recently I did some investigation over the GET path
to understand where time spend.
I first confirmed that the latency came from OSD (read op), so that we
instrumented code to trace the GET request (read op at OSD side, to be more
specific,
How'd you instrument it? Are you aware of the OpTracker system that
records and can output important events?
Post by Guang Yang
each object with size [512K + 4M * x] are splitted into [1 + x]
chunks, each chunk needs one read op ), for each read op, it needs to go
1. Dispatch and take by a op thread to process (process not started).
0 - 20 ms, 94%
20 - 50 ms, 2%
50 - 100 ms, 2%
100ms+ , 2%
For those having 20ms+ latency, half of them are due to waiting for
pg lock (https://github.com/ceph/ceph/blob/dumpling/src/osd/OSD.cc#L7089),
another half are yet to be investigated.
The PG lock conflict means that there's something else happening in
the PG at the same time; that's a logical contention issue. However,
20ms is a long time, so if you can figure out what else the PG is
doing during that time it'd be interesting.
Post by Guang Yang
2. Get file xattr ('-'), which open the file and populate fd cache
(https://github.com/ceph/ceph/blob/dumpling/src/os/FileStore.cc#L230).
0 - 20 ms, 80%
20 - 50 ms, 8%
50 - 100 ms, 7%
100ms+ , 5%
The latency either comes from (from more to less): file path lookup
(https://github.com/ceph/ceph/blob/dumpling/src/os/HashIndex.cc#L294), file
open, or fd cache lookup /add.
Currently objects are store in level 6 or level 7 folder (due to
http://tracker.ceph.com/issues/7207, I stopped folder splitting).
FYI there's been some community and Inktank work to try and speed this
up recently. None of it's been merged into master yet, but we'll
definitely have some improvements to this post-Firefly.
-Greg
Post by Guang Yang
3. Get more xattrs, this is fast due to previous fd cache (rarely >
1ms).
4. Read the data.
0 - 20 ms, 84%
20 - 50 ms, 10%
50 - 100 ms, 4%
100ms+ , 2%
I decreased vfs_cache_pressure from its default value 100 to 5 to make VFS
favor dentry/inode cache over page cache, unfortunately it does not help.
Long story short, most of the long latency read op comes from file system
call (for cold data), as our workload mainly stores objects less than 500KB,
so that it generates a large bunch of objects.
I would like to ask if people experienced similar issue and if there is any
suggestion I can try to boost the GET performance. On the other hand, PUT
could be sacrificed.
Thanks,
Guang
_______________________________________________
ceph-users mailing list
ceph-users at lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
GuangYang
2014-02-21 14:35:21 UTC
Permalink
Thanks Greg for the response, my comments inline?

Thanks,
Guang
Post by Gregory Farnum
Post by Guang Yang
Hi ceph-users,
We are using Ceph (radosgw) to store user generated images, as GET latency
is critical for us, most recently I did some investigation over the GET path
to understand where time spend.
I first confirmed that the latency came from OSD (read op), so that we
instrumented code to trace the GET request (read op at OSD side, to be more
specific,
How'd you instrument it? Are you aware of the OpTracker system that
records and can output important events?
I just add some time measurement code before and after some calls, here is a the diff (I ported the lock contention fix to dumpling on my test, please ignore that) - https://github.com/ceph/ceph/pull/1281/files
Yeah we are using OpTracker as well (dump history op), however, from the dumped history op, it is a little bit hard to tell where does the time spend (it is between started and finish). As read latency is critical for us, we would like to check next level details so that we add some checking for lock / file system call.

I am also thinking to refine the code more log to reflect where time spent at filestore side, are you interested in such type of code change? (it took me some time to trouble shoot the get latency and finally I need to add some more log to get a picture).
Post by Gregory Farnum
Post by Guang Yang
each object with size [512K + 4M * x] are splitted into [1 + x]
chunks, each chunk needs one read op ), for each read op, it needs to go
1. Dispatch and take by a op thread to process (process not started).
0 - 20 ms, 94%
20 - 50 ms, 2%
50 - 100 ms, 2%
100ms+ , 2%
For those having 20ms+ latency, half of them are due to waiting for
pg lock (https://github.com/ceph/ceph/blob/dumpling/src/osd/OSD.cc#L7089),
another half are yet to be investigated.
The PG lock conflict means that there's something else happening in
the PG at the same time; that's a logical contention issue. However,
20ms is a long time, so if you can figure out what else the PG is
doing during that time it'd be interesting
I haven?t deep dive that yet (that is a relatively small percentage comparing to the second part (which comes from filestore i/o slowness due to Dentry / inode cache), my interoperation is that as we serialize PG operation and there are some ops taking (like caused by the below), so that there is a chance the op are trying to operate on the same pg might wait for some time.
BTW, what is the long term strategy for Ceph to support small files (which motivated Facebook to build haystack to solve)?
Post by Gregory Farnum
Post by Guang Yang
2. Get file xattr ('-'), which open the file and populate fd cache
(https://github.com/ceph/ceph/blob/dumpling/src/os/FileStore.cc#L230).
0 - 20 ms, 80%
20 - 50 ms, 8%
50 - 100 ms, 7%
100ms+ , 5%
The latency either comes from (from more to less): file path lookup
(https://github.com/ceph/ceph/blob/dumpling/src/os/HashIndex.cc#L294), file
open, or fd cache lookup /add.
Currently objects are store in level 6 or level 7 folder (due to
http://tracker.ceph.com/issues/7207, I stopped folder splitting).
FYI there's been some community and Inktank work to try and speed this
up recently. None of it's been merged into master yet, but we'll
definitely have some improvements to this post-Firefly.
Yeah I am aware of that and we are extremely interested in the effort as well. Basically our use case is:
1. Get latency is more critical than PUT latency.
2. Our work load is mainly small files (95% are less than 512KB).
Post by Gregory Farnum
-Greg
Post by Guang Yang
3. Get more xattrs, this is fast due to previous fd cache (rarely >
1ms).
4. Read the data.
0 - 20 ms, 84%
20 - 50 ms, 10%
50 - 100 ms, 4%
100ms+ , 2%
I decreased vfs_cache_pressure from its default value 100 to 5 to make VFS
favor dentry/inode cache over page cache, unfortunately it does not help.
Long story short, most of the long latency read op comes from file system
call (for cold data), as our workload mainly stores objects less than 500KB,
so that it generates a large bunch of objects.
I would like to ask if people experienced similar issue and if there is any
suggestion I can try to boost the GET performance. On the other hand, PUT
could be sacrificed.
Thanks,
Guang
_______________________________________________
ceph-users mailing list
ceph-users at lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140221/f08152db/attachment.htm>
GuangYang
2014-02-21 14:35:21 UTC
Permalink
Thanks Greg for the response, my comments inline?

Thanks,
Guang
Post by Gregory Farnum
Post by Guang Yang
Hi ceph-users,
We are using Ceph (radosgw) to store user generated images, as GET latency
is critical for us, most recently I did some investigation over the GET path
to understand where time spend.
I first confirmed that the latency came from OSD (read op), so that we
instrumented code to trace the GET request (read op at OSD side, to be more
specific,
How'd you instrument it? Are you aware of the OpTracker system that
records and can output important events?
I just add some time measurement code before and after some calls, here is a the diff (I ported the lock contention fix to dumpling on my test, please ignore that) - https://github.com/ceph/ceph/pull/1281/files
Yeah we are using OpTracker as well (dump history op), however, from the dumped history op, it is a little bit hard to tell where does the time spend (it is between started and finish). As read latency is critical for us, we would like to check next level details so that we add some checking for lock / file system call.

I am also thinking to refine the code more log to reflect where time spent at filestore side, are you interested in such type of code change? (it took me some time to trouble shoot the get latency and finally I need to add some more log to get a picture).
Post by Gregory Farnum
Post by Guang Yang
each object with size [512K + 4M * x] are splitted into [1 + x]
chunks, each chunk needs one read op ), for each read op, it needs to go
1. Dispatch and take by a op thread to process (process not started).
0 - 20 ms, 94%
20 - 50 ms, 2%
50 - 100 ms, 2%
100ms+ , 2%
For those having 20ms+ latency, half of them are due to waiting for
pg lock (https://github.com/ceph/ceph/blob/dumpling/src/osd/OSD.cc#L7089),
another half are yet to be investigated.
The PG lock conflict means that there's something else happening in
the PG at the same time; that's a logical contention issue. However,
20ms is a long time, so if you can figure out what else the PG is
doing during that time it'd be interesting
I haven?t deep dive that yet (that is a relatively small percentage comparing to the second part (which comes from filestore i/o slowness due to Dentry / inode cache), my interoperation is that as we serialize PG operation and there are some ops taking (like caused by the below), so that there is a chance the op are trying to operate on the same pg might wait for some time.
BTW, what is the long term strategy for Ceph to support small files (which motivated Facebook to build haystack to solve)?
Post by Gregory Farnum
Post by Guang Yang
2. Get file xattr ('-'), which open the file and populate fd cache
(https://github.com/ceph/ceph/blob/dumpling/src/os/FileStore.cc#L230).
0 - 20 ms, 80%
20 - 50 ms, 8%
50 - 100 ms, 7%
100ms+ , 5%
The latency either comes from (from more to less): file path lookup
(https://github.com/ceph/ceph/blob/dumpling/src/os/HashIndex.cc#L294), file
open, or fd cache lookup /add.
Currently objects are store in level 6 or level 7 folder (due to
http://tracker.ceph.com/issues/7207, I stopped folder splitting).
FYI there's been some community and Inktank work to try and speed this
up recently. None of it's been merged into master yet, but we'll
definitely have some improvements to this post-Firefly.
Yeah I am aware of that and we are extremely interested in the effort as well. Basically our use case is:
1. Get latency is more critical than PUT latency.
2. Our work load is mainly small files (95% are less than 512KB).
Post by Gregory Farnum
-Greg
Post by Guang Yang
3. Get more xattrs, this is fast due to previous fd cache (rarely >
1ms).
4. Read the data.
0 - 20 ms, 84%
20 - 50 ms, 10%
50 - 100 ms, 4%
100ms+ , 2%
I decreased vfs_cache_pressure from its default value 100 to 5 to make VFS
favor dentry/inode cache over page cache, unfortunately it does not help.
Long story short, most of the long latency read op comes from file system
call (for cold data), as our workload mainly stores objects less than 500KB,
so that it generates a large bunch of objects.
I would like to ask if people experienced similar issue and if there is any
suggestion I can try to boost the GET performance. On the other hand, PUT
could be sacrificed.
Thanks,
Guang
_______________________________________________
ceph-users mailing list
ceph-users at lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140221/f08152db/attachment-0002.htm>
GuangYang
2014-02-21 14:35:21 UTC
Permalink
Thanks Greg for the response, my comments inline?

Thanks,
Guang
Post by Gregory Farnum
Post by Guang Yang
Hi ceph-users,
We are using Ceph (radosgw) to store user generated images, as GET latency
is critical for us, most recently I did some investigation over the GET path
to understand where time spend.
I first confirmed that the latency came from OSD (read op), so that we
instrumented code to trace the GET request (read op at OSD side, to be more
specific,
How'd you instrument it? Are you aware of the OpTracker system that
records and can output important events?
I just add some time measurement code before and after some calls, here is a the diff (I ported the lock contention fix to dumpling on my test, please ignore that) - https://github.com/ceph/ceph/pull/1281/files
Yeah we are using OpTracker as well (dump history op), however, from the dumped history op, it is a little bit hard to tell where does the time spend (it is between started and finish). As read latency is critical for us, we would like to check next level details so that we add some checking for lock / file system call.

I am also thinking to refine the code more log to reflect where time spent at filestore side, are you interested in such type of code change? (it took me some time to trouble shoot the get latency and finally I need to add some more log to get a picture).
Post by Gregory Farnum
Post by Guang Yang
each object with size [512K + 4M * x] are splitted into [1 + x]
chunks, each chunk needs one read op ), for each read op, it needs to go
1. Dispatch and take by a op thread to process (process not started).
0 - 20 ms, 94%
20 - 50 ms, 2%
50 - 100 ms, 2%
100ms+ , 2%
For those having 20ms+ latency, half of them are due to waiting for
pg lock (https://github.com/ceph/ceph/blob/dumpling/src/osd/OSD.cc#L7089),
another half are yet to be investigated.
The PG lock conflict means that there's something else happening in
the PG at the same time; that's a logical contention issue. However,
20ms is a long time, so if you can figure out what else the PG is
doing during that time it'd be interesting
I haven?t deep dive that yet (that is a relatively small percentage comparing to the second part (which comes from filestore i/o slowness due to Dentry / inode cache), my interoperation is that as we serialize PG operation and there are some ops taking (like caused by the below), so that there is a chance the op are trying to operate on the same pg might wait for some time.
BTW, what is the long term strategy for Ceph to support small files (which motivated Facebook to build haystack to solve)?
Post by Gregory Farnum
Post by Guang Yang
2. Get file xattr ('-'), which open the file and populate fd cache
(https://github.com/ceph/ceph/blob/dumpling/src/os/FileStore.cc#L230).
0 - 20 ms, 80%
20 - 50 ms, 8%
50 - 100 ms, 7%
100ms+ , 5%
The latency either comes from (from more to less): file path lookup
(https://github.com/ceph/ceph/blob/dumpling/src/os/HashIndex.cc#L294), file
open, or fd cache lookup /add.
Currently objects are store in level 6 or level 7 folder (due to
http://tracker.ceph.com/issues/7207, I stopped folder splitting).
FYI there's been some community and Inktank work to try and speed this
up recently. None of it's been merged into master yet, but we'll
definitely have some improvements to this post-Firefly.
Yeah I am aware of that and we are extremely interested in the effort as well. Basically our use case is:
1. Get latency is more critical than PUT latency.
2. Our work load is mainly small files (95% are less than 512KB).
Post by Gregory Farnum
-Greg
Post by Guang Yang
3. Get more xattrs, this is fast due to previous fd cache (rarely >
1ms).
4. Read the data.
0 - 20 ms, 84%
20 - 50 ms, 10%
50 - 100 ms, 4%
100ms+ , 2%
I decreased vfs_cache_pressure from its default value 100 to 5 to make VFS
favor dentry/inode cache over page cache, unfortunately it does not help.
Long story short, most of the long latency read op comes from file system
call (for cold data), as our workload mainly stores objects less than 500KB,
so that it generates a large bunch of objects.
I would like to ask if people experienced similar issue and if there is any
suggestion I can try to boost the GET performance. On the other hand, PUT
could be sacrificed.
Thanks,
Guang
_______________________________________________
ceph-users mailing list
ceph-users at lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140221/f08152db/attachment-0003.htm>
GuangYang
2014-02-21 14:35:21 UTC
Permalink
Thanks Greg for the response, my comments inline?

Thanks,
Guang
Post by Gregory Farnum
Post by Guang Yang
Hi ceph-users,
We are using Ceph (radosgw) to store user generated images, as GET latency
is critical for us, most recently I did some investigation over the GET path
to understand where time spend.
I first confirmed that the latency came from OSD (read op), so that we
instrumented code to trace the GET request (read op at OSD side, to be more
specific,
How'd you instrument it? Are you aware of the OpTracker system that
records and can output important events?
I just add some time measurement code before and after some calls, here is a the diff (I ported the lock contention fix to dumpling on my test, please ignore that) - https://github.com/ceph/ceph/pull/1281/files
Yeah we are using OpTracker as well (dump history op), however, from the dumped history op, it is a little bit hard to tell where does the time spend (it is between started and finish). As read latency is critical for us, we would like to check next level details so that we add some checking for lock / file system call.

I am also thinking to refine the code more log to reflect where time spent at filestore side, are you interested in such type of code change? (it took me some time to trouble shoot the get latency and finally I need to add some more log to get a picture).
Post by Gregory Farnum
Post by Guang Yang
each object with size [512K + 4M * x] are splitted into [1 + x]
chunks, each chunk needs one read op ), for each read op, it needs to go
1. Dispatch and take by a op thread to process (process not started).
0 - 20 ms, 94%
20 - 50 ms, 2%
50 - 100 ms, 2%
100ms+ , 2%
For those having 20ms+ latency, half of them are due to waiting for
pg lock (https://github.com/ceph/ceph/blob/dumpling/src/osd/OSD.cc#L7089),
another half are yet to be investigated.
The PG lock conflict means that there's something else happening in
the PG at the same time; that's a logical contention issue. However,
20ms is a long time, so if you can figure out what else the PG is
doing during that time it'd be interesting
I haven?t deep dive that yet (that is a relatively small percentage comparing to the second part (which comes from filestore i/o slowness due to Dentry / inode cache), my interoperation is that as we serialize PG operation and there are some ops taking (like caused by the below), so that there is a chance the op are trying to operate on the same pg might wait for some time.
BTW, what is the long term strategy for Ceph to support small files (which motivated Facebook to build haystack to solve)?
Post by Gregory Farnum
Post by Guang Yang
2. Get file xattr ('-'), which open the file and populate fd cache
(https://github.com/ceph/ceph/blob/dumpling/src/os/FileStore.cc#L230).
0 - 20 ms, 80%
20 - 50 ms, 8%
50 - 100 ms, 7%
100ms+ , 5%
The latency either comes from (from more to less): file path lookup
(https://github.com/ceph/ceph/blob/dumpling/src/os/HashIndex.cc#L294), file
open, or fd cache lookup /add.
Currently objects are store in level 6 or level 7 folder (due to
http://tracker.ceph.com/issues/7207, I stopped folder splitting).
FYI there's been some community and Inktank work to try and speed this
up recently. None of it's been merged into master yet, but we'll
definitely have some improvements to this post-Firefly.
Yeah I am aware of that and we are extremely interested in the effort as well. Basically our use case is:
1. Get latency is more critical than PUT latency.
2. Our work load is mainly small files (95% are less than 512KB).
Post by Gregory Farnum
-Greg
Post by Guang Yang
3. Get more xattrs, this is fast due to previous fd cache (rarely >
1ms).
4. Read the data.
0 - 20 ms, 84%
20 - 50 ms, 10%
50 - 100 ms, 4%
100ms+ , 2%
I decreased vfs_cache_pressure from its default value 100 to 5 to make VFS
favor dentry/inode cache over page cache, unfortunately it does not help.
Long story short, most of the long latency read op comes from file system
call (for cold data), as our workload mainly stores objects less than 500KB,
so that it generates a large bunch of objects.
I would like to ask if people experienced similar issue and if there is any
suggestion I can try to boost the GET performance. On the other hand, PUT
could be sacrificed.
Thanks,
Guang
_______________________________________________
ceph-users mailing list
ceph-users at lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140221/f08152db/attachment-0004.htm>
Guang Yang
2014-02-18 15:24:34 UTC
Permalink
Hi ceph-users,
We are using Ceph (radosgw) to store user generated images, as GET latency is critical for us, most recently I did some investigation over the GET path to understand where time spend.

I first confirmed that the latency came from OSD (read op), so that we instrumented code to trace the GET request (read op at OSD side, to be more specific, each object with size [512K + 4M * x] ?are splitted into [1 + x] chunks, each chunk needs one read op ), for each read op, it needs to go through the following steps:
? ? 1. Dispatch and take by a op thread to process (process not started).
? ? ? ? ? ? ?0 ? ? 20 ms, ? ?94%
? ? ? ? ? ? ?20 ? 50 ms, ? ?2%
? ? ? ? ? ? ?50 ? 100 ms, ?2%
? ? ? ? ? ? ? 100ms+ ? , ? ? ? ? 2%
? ? ? ? ?For those having 20ms+ latency, half of them are due to waiting for pg lock (https://github.com/ceph/ceph/blob/dumpling/src/osd/OSD.cc#L7089), another half are yet to be investigated.

? ? 2. Get file xattr (?-?), which open the file and populate fd cache (https://github.com/ceph/ceph/blob/dumpling/src/os/FileStore.cc#L230).
? ? ? ? ? ? ? 0 ? ? 20 ms, ?80%
? ? ? ? ? ? ? 20 ? 50 ms, ? 8%
? ? ? ? ? ? ? 50 ? 100 ms, 7%
? ? ? ? ? ? ? 100ms+ ? , ? ? ?5%
? ? ? ? ?The latency either comes from (from more to less): file path lookup (https://github.com/ceph/ceph/blob/dumpling/src/os/HashIndex.cc#L294), file open, or fd cache lookup /add.
? ? ? ? ?Currently objects are store in level 6 or level 7 folder (due to?http://tracker.ceph.com/issues/7207, I stopped folder splitting).

? ? 3. Get more xattrs, this is fast due to previous fd cache (rarely > 1ms).

? ? 4. Read the data.
? ? ? ? ? ? 0 ? ? 20 ms, ? 84%
? ? ? ? ? ? 20 ? 50 ms, 10%
? ? ? ? ? ? 50 ? 100 ms, 4%
? ? ? ? ? ? 100ms+ ? ? ? ?, 2%

I decreased?vfs_cache_pressure from its default value 100 to 5 to make VFS favor dentry/inode cache over page cache, unfortunately it does not help.

Long story short, most of the long latency read op comes from file system call (for cold data), as our workload mainly stores objects less than 500KB, so that it generates a large bunch of objects.

I would like to ask if people experienced similar issue and if there is any suggestion I can try to boost the GET performance. On the other hand, PUT could be sacrificed.

Thanks,
Guang
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140218/f759f9c4/attachment-0002.htm>
Yehuda Sadeh
2014-02-19 19:09:22 UTC
Permalink
Post by Guang Yang
Hi ceph-users,
We are using Ceph (radosgw) to store user generated images, as GET latency
is critical for us, most recently I did some investigation over the GET path
to understand where time spend.
I first confirmed that the latency came from OSD (read op), so that we
instrumented code to trace the GET request (read op at OSD side, to be more
specific, each object with size [512K + 4M * x] are splitted into [1 + x]
chunks, each chunk needs one read op ), for each read op, it needs to go
1. Dispatch and take by a op thread to process (process not started).
0 - 20 ms, 94%
20 - 50 ms, 2%
50 - 100 ms, 2%
100ms+ , 2%
For those having 20ms+ latency, half of them are due to waiting for
pg lock (https://github.com/ceph/ceph/blob/dumpling/src/osd/OSD.cc#L7089),
another half are yet to be investigated.
2. Get file xattr ('-'), which open the file and populate fd cache
(https://github.com/ceph/ceph/blob/dumpling/src/os/FileStore.cc#L230).
0 - 20 ms, 80%
20 - 50 ms, 8%
50 - 100 ms, 7%
100ms+ , 5%
The latency either comes from (from more to less): file path lookup
(https://github.com/ceph/ceph/blob/dumpling/src/os/HashIndex.cc#L294), file
open, or fd cache lookup /add.
Currently objects are store in level 6 or level 7 folder (due to
http://tracker.ceph.com/issues/7207, I stopped folder splitting).
3. Get more xattrs, this is fast due to previous fd cache (rarely >
1ms).
4. Read the data.
0 - 20 ms, 84%
20 - 50 ms, 10%
50 - 100 ms, 4%
100ms+ , 2%
I decreased vfs_cache_pressure from its default value 100 to 5 to make VFS
favor dentry/inode cache over page cache, unfortunately it does not help.
Long story short, most of the long latency read op comes from file system
call (for cold data), as our workload mainly stores objects less than 500KB,
so that it generates a large bunch of objects.
I would like to ask if people experienced similar issue and if there is any
suggestion I can try to boost the GET performance. On the other hand, PUT
could be sacrificed.
Try looking at the perfcounters, see if there's any other throttling
happening. Also, make sure you have enough pgs for your data pool. One
other thing to try is disabling leveldb xattrs and see if it affects
your latency.

Yehuda
Gregory Farnum
2014-02-20 15:16:29 UTC
Permalink
Post by Guang Yang
Hi ceph-users,
We are using Ceph (radosgw) to store user generated images, as GET latency
is critical for us, most recently I did some investigation over the GET path
to understand where time spend.
I first confirmed that the latency came from OSD (read op), so that we
instrumented code to trace the GET request (read op at OSD side, to be more
specific,
How'd you instrument it? Are you aware of the OpTracker system that
records and can output important events?
Post by Guang Yang
each object with size [512K + 4M * x] are splitted into [1 + x]
chunks, each chunk needs one read op ), for each read op, it needs to go
1. Dispatch and take by a op thread to process (process not started).
0 - 20 ms, 94%
20 - 50 ms, 2%
50 - 100 ms, 2%
100ms+ , 2%
For those having 20ms+ latency, half of them are due to waiting for
pg lock (https://github.com/ceph/ceph/blob/dumpling/src/osd/OSD.cc#L7089),
another half are yet to be investigated.
The PG lock conflict means that there's something else happening in
the PG at the same time; that's a logical contention issue. However,
20ms is a long time, so if you can figure out what else the PG is
doing during that time it'd be interesting.
Post by Guang Yang
2. Get file xattr ('-'), which open the file and populate fd cache
(https://github.com/ceph/ceph/blob/dumpling/src/os/FileStore.cc#L230).
0 - 20 ms, 80%
20 - 50 ms, 8%
50 - 100 ms, 7%
100ms+ , 5%
The latency either comes from (from more to less): file path lookup
(https://github.com/ceph/ceph/blob/dumpling/src/os/HashIndex.cc#L294), file
open, or fd cache lookup /add.
Currently objects are store in level 6 or level 7 folder (due to
http://tracker.ceph.com/issues/7207, I stopped folder splitting).
FYI there's been some community and Inktank work to try and speed this
up recently. None of it's been merged into master yet, but we'll
definitely have some improvements to this post-Firefly.
-Greg
Post by Guang Yang
3. Get more xattrs, this is fast due to previous fd cache (rarely >
1ms).
4. Read the data.
0 - 20 ms, 84%
20 - 50 ms, 10%
50 - 100 ms, 4%
100ms+ , 2%
I decreased vfs_cache_pressure from its default value 100 to 5 to make VFS
favor dentry/inode cache over page cache, unfortunately it does not help.
Long story short, most of the long latency read op comes from file system
call (for cold data), as our workload mainly stores objects less than 500KB,
so that it generates a large bunch of objects.
I would like to ask if people experienced similar issue and if there is any
suggestion I can try to boost the GET performance. On the other hand, PUT
could be sacrificed.
Thanks,
Guang
_______________________________________________
ceph-users mailing list
ceph-users at lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Guang Yang
2014-02-18 15:24:34 UTC
Permalink
Hi ceph-users,
We are using Ceph (radosgw) to store user generated images, as GET latency is critical for us, most recently I did some investigation over the GET path to understand where time spend.

I first confirmed that the latency came from OSD (read op), so that we instrumented code to trace the GET request (read op at OSD side, to be more specific, each object with size [512K + 4M * x] ?are splitted into [1 + x] chunks, each chunk needs one read op ), for each read op, it needs to go through the following steps:
? ? 1. Dispatch and take by a op thread to process (process not started).
? ? ? ? ? ? ?0 ? ? 20 ms, ? ?94%
? ? ? ? ? ? ?20 ? 50 ms, ? ?2%
? ? ? ? ? ? ?50 ? 100 ms, ?2%
? ? ? ? ? ? ? 100ms+ ? , ? ? ? ? 2%
? ? ? ? ?For those having 20ms+ latency, half of them are due to waiting for pg lock (https://github.com/ceph/ceph/blob/dumpling/src/osd/OSD.cc#L7089), another half are yet to be investigated.

? ? 2. Get file xattr (?-?), which open the file and populate fd cache (https://github.com/ceph/ceph/blob/dumpling/src/os/FileStore.cc#L230).
? ? ? ? ? ? ? 0 ? ? 20 ms, ?80%
? ? ? ? ? ? ? 20 ? 50 ms, ? 8%
? ? ? ? ? ? ? 50 ? 100 ms, 7%
? ? ? ? ? ? ? 100ms+ ? , ? ? ?5%
? ? ? ? ?The latency either comes from (from more to less): file path lookup (https://github.com/ceph/ceph/blob/dumpling/src/os/HashIndex.cc#L294), file open, or fd cache lookup /add.
? ? ? ? ?Currently objects are store in level 6 or level 7 folder (due to?http://tracker.ceph.com/issues/7207, I stopped folder splitting).

? ? 3. Get more xattrs, this is fast due to previous fd cache (rarely > 1ms).

? ? 4. Read the data.
? ? ? ? ? ? 0 ? ? 20 ms, ? 84%
? ? ? ? ? ? 20 ? 50 ms, 10%
? ? ? ? ? ? 50 ? 100 ms, 4%
? ? ? ? ? ? 100ms+ ? ? ? ?, 2%

I decreased?vfs_cache_pressure from its default value 100 to 5 to make VFS favor dentry/inode cache over page cache, unfortunately it does not help.

Long story short, most of the long latency read op comes from file system call (for cold data), as our workload mainly stores objects less than 500KB, so that it generates a large bunch of objects.

I would like to ask if people experienced similar issue and if there is any suggestion I can try to boost the GET performance. On the other hand, PUT could be sacrificed.

Thanks,
Guang
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140218/f759f9c4/attachment-0003.htm>
Yehuda Sadeh
2014-02-19 19:09:22 UTC
Permalink
Post by Guang Yang
Hi ceph-users,
We are using Ceph (radosgw) to store user generated images, as GET latency
is critical for us, most recently I did some investigation over the GET path
to understand where time spend.
I first confirmed that the latency came from OSD (read op), so that we
instrumented code to trace the GET request (read op at OSD side, to be more
specific, each object with size [512K + 4M * x] are splitted into [1 + x]
chunks, each chunk needs one read op ), for each read op, it needs to go
1. Dispatch and take by a op thread to process (process not started).
0 - 20 ms, 94%
20 - 50 ms, 2%
50 - 100 ms, 2%
100ms+ , 2%
For those having 20ms+ latency, half of them are due to waiting for
pg lock (https://github.com/ceph/ceph/blob/dumpling/src/osd/OSD.cc#L7089),
another half are yet to be investigated.
2. Get file xattr ('-'), which open the file and populate fd cache
(https://github.com/ceph/ceph/blob/dumpling/src/os/FileStore.cc#L230).
0 - 20 ms, 80%
20 - 50 ms, 8%
50 - 100 ms, 7%
100ms+ , 5%
The latency either comes from (from more to less): file path lookup
(https://github.com/ceph/ceph/blob/dumpling/src/os/HashIndex.cc#L294), file
open, or fd cache lookup /add.
Currently objects are store in level 6 or level 7 folder (due to
http://tracker.ceph.com/issues/7207, I stopped folder splitting).
3. Get more xattrs, this is fast due to previous fd cache (rarely >
1ms).
4. Read the data.
0 - 20 ms, 84%
20 - 50 ms, 10%
50 - 100 ms, 4%
100ms+ , 2%
I decreased vfs_cache_pressure from its default value 100 to 5 to make VFS
favor dentry/inode cache over page cache, unfortunately it does not help.
Long story short, most of the long latency read op comes from file system
call (for cold data), as our workload mainly stores objects less than 500KB,
so that it generates a large bunch of objects.
I would like to ask if people experienced similar issue and if there is any
suggestion I can try to boost the GET performance. On the other hand, PUT
could be sacrificed.
Try looking at the perfcounters, see if there's any other throttling
happening. Also, make sure you have enough pgs for your data pool. One
other thing to try is disabling leveldb xattrs and see if it affects
your latency.

Yehuda
Gregory Farnum
2014-02-20 15:16:29 UTC
Permalink
Post by Guang Yang
Hi ceph-users,
We are using Ceph (radosgw) to store user generated images, as GET latency
is critical for us, most recently I did some investigation over the GET path
to understand where time spend.
I first confirmed that the latency came from OSD (read op), so that we
instrumented code to trace the GET request (read op at OSD side, to be more
specific,
How'd you instrument it? Are you aware of the OpTracker system that
records and can output important events?
Post by Guang Yang
each object with size [512K + 4M * x] are splitted into [1 + x]
chunks, each chunk needs one read op ), for each read op, it needs to go
1. Dispatch and take by a op thread to process (process not started).
0 - 20 ms, 94%
20 - 50 ms, 2%
50 - 100 ms, 2%
100ms+ , 2%
For those having 20ms+ latency, half of them are due to waiting for
pg lock (https://github.com/ceph/ceph/blob/dumpling/src/osd/OSD.cc#L7089),
another half are yet to be investigated.
The PG lock conflict means that there's something else happening in
the PG at the same time; that's a logical contention issue. However,
20ms is a long time, so if you can figure out what else the PG is
doing during that time it'd be interesting.
Post by Guang Yang
2. Get file xattr ('-'), which open the file and populate fd cache
(https://github.com/ceph/ceph/blob/dumpling/src/os/FileStore.cc#L230).
0 - 20 ms, 80%
20 - 50 ms, 8%
50 - 100 ms, 7%
100ms+ , 5%
The latency either comes from (from more to less): file path lookup
(https://github.com/ceph/ceph/blob/dumpling/src/os/HashIndex.cc#L294), file
open, or fd cache lookup /add.
Currently objects are store in level 6 or level 7 folder (due to
http://tracker.ceph.com/issues/7207, I stopped folder splitting).
FYI there's been some community and Inktank work to try and speed this
up recently. None of it's been merged into master yet, but we'll
definitely have some improvements to this post-Firefly.
-Greg
Post by Guang Yang
3. Get more xattrs, this is fast due to previous fd cache (rarely >
1ms).
4. Read the data.
0 - 20 ms, 84%
20 - 50 ms, 10%
50 - 100 ms, 4%
100ms+ , 2%
I decreased vfs_cache_pressure from its default value 100 to 5 to make VFS
favor dentry/inode cache over page cache, unfortunately it does not help.
Long story short, most of the long latency read op comes from file system
call (for cold data), as our workload mainly stores objects less than 500KB,
so that it generates a large bunch of objects.
I would like to ask if people experienced similar issue and if there is any
suggestion I can try to boost the GET performance. On the other hand, PUT
could be sacrificed.
Thanks,
Guang
_______________________________________________
ceph-users mailing list
ceph-users at lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Guang Yang
2014-02-18 15:24:34 UTC
Permalink
Hi ceph-users,
We are using Ceph (radosgw) to store user generated images, as GET latency is critical for us, most recently I did some investigation over the GET path to understand where time spend.

I first confirmed that the latency came from OSD (read op), so that we instrumented code to trace the GET request (read op at OSD side, to be more specific, each object with size [512K + 4M * x] ?are splitted into [1 + x] chunks, each chunk needs one read op ), for each read op, it needs to go through the following steps:
? ? 1. Dispatch and take by a op thread to process (process not started).
? ? ? ? ? ? ?0 ? ? 20 ms, ? ?94%
? ? ? ? ? ? ?20 ? 50 ms, ? ?2%
? ? ? ? ? ? ?50 ? 100 ms, ?2%
? ? ? ? ? ? ? 100ms+ ? , ? ? ? ? 2%
? ? ? ? ?For those having 20ms+ latency, half of them are due to waiting for pg lock (https://github.com/ceph/ceph/blob/dumpling/src/osd/OSD.cc#L7089), another half are yet to be investigated.

? ? 2. Get file xattr (?-?), which open the file and populate fd cache (https://github.com/ceph/ceph/blob/dumpling/src/os/FileStore.cc#L230).
? ? ? ? ? ? ? 0 ? ? 20 ms, ?80%
? ? ? ? ? ? ? 20 ? 50 ms, ? 8%
? ? ? ? ? ? ? 50 ? 100 ms, 7%
? ? ? ? ? ? ? 100ms+ ? , ? ? ?5%
? ? ? ? ?The latency either comes from (from more to less): file path lookup (https://github.com/ceph/ceph/blob/dumpling/src/os/HashIndex.cc#L294), file open, or fd cache lookup /add.
? ? ? ? ?Currently objects are store in level 6 or level 7 folder (due to?http://tracker.ceph.com/issues/7207, I stopped folder splitting).

? ? 3. Get more xattrs, this is fast due to previous fd cache (rarely > 1ms).

? ? 4. Read the data.
? ? ? ? ? ? 0 ? ? 20 ms, ? 84%
? ? ? ? ? ? 20 ? 50 ms, 10%
? ? ? ? ? ? 50 ? 100 ms, 4%
? ? ? ? ? ? 100ms+ ? ? ? ?, 2%

I decreased?vfs_cache_pressure from its default value 100 to 5 to make VFS favor dentry/inode cache over page cache, unfortunately it does not help.

Long story short, most of the long latency read op comes from file system call (for cold data), as our workload mainly stores objects less than 500KB, so that it generates a large bunch of objects.

I would like to ask if people experienced similar issue and if there is any suggestion I can try to boost the GET performance. On the other hand, PUT could be sacrificed.

Thanks,
Guang
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140218/f759f9c4/attachment-0004.htm>
Yehuda Sadeh
2014-02-19 19:09:22 UTC
Permalink
Post by Guang Yang
Hi ceph-users,
We are using Ceph (radosgw) to store user generated images, as GET latency
is critical for us, most recently I did some investigation over the GET path
to understand where time spend.
I first confirmed that the latency came from OSD (read op), so that we
instrumented code to trace the GET request (read op at OSD side, to be more
specific, each object with size [512K + 4M * x] are splitted into [1 + x]
chunks, each chunk needs one read op ), for each read op, it needs to go
1. Dispatch and take by a op thread to process (process not started).
0 - 20 ms, 94%
20 - 50 ms, 2%
50 - 100 ms, 2%
100ms+ , 2%
For those having 20ms+ latency, half of them are due to waiting for
pg lock (https://github.com/ceph/ceph/blob/dumpling/src/osd/OSD.cc#L7089),
another half are yet to be investigated.
2. Get file xattr ('-'), which open the file and populate fd cache
(https://github.com/ceph/ceph/blob/dumpling/src/os/FileStore.cc#L230).
0 - 20 ms, 80%
20 - 50 ms, 8%
50 - 100 ms, 7%
100ms+ , 5%
The latency either comes from (from more to less): file path lookup
(https://github.com/ceph/ceph/blob/dumpling/src/os/HashIndex.cc#L294), file
open, or fd cache lookup /add.
Currently objects are store in level 6 or level 7 folder (due to
http://tracker.ceph.com/issues/7207, I stopped folder splitting).
3. Get more xattrs, this is fast due to previous fd cache (rarely >
1ms).
4. Read the data.
0 - 20 ms, 84%
20 - 50 ms, 10%
50 - 100 ms, 4%
100ms+ , 2%
I decreased vfs_cache_pressure from its default value 100 to 5 to make VFS
favor dentry/inode cache over page cache, unfortunately it does not help.
Long story short, most of the long latency read op comes from file system
call (for cold data), as our workload mainly stores objects less than 500KB,
so that it generates a large bunch of objects.
I would like to ask if people experienced similar issue and if there is any
suggestion I can try to boost the GET performance. On the other hand, PUT
could be sacrificed.
Try looking at the perfcounters, see if there's any other throttling
happening. Also, make sure you have enough pgs for your data pool. One
other thing to try is disabling leveldb xattrs and see if it affects
your latency.

Yehuda
Gregory Farnum
2014-02-20 15:16:29 UTC
Permalink
Post by Guang Yang
Hi ceph-users,
We are using Ceph (radosgw) to store user generated images, as GET latency
is critical for us, most recently I did some investigation over the GET path
to understand where time spend.
I first confirmed that the latency came from OSD (read op), so that we
instrumented code to trace the GET request (read op at OSD side, to be more
specific,
How'd you instrument it? Are you aware of the OpTracker system that
records and can output important events?
Post by Guang Yang
each object with size [512K + 4M * x] are splitted into [1 + x]
chunks, each chunk needs one read op ), for each read op, it needs to go
1. Dispatch and take by a op thread to process (process not started).
0 - 20 ms, 94%
20 - 50 ms, 2%
50 - 100 ms, 2%
100ms+ , 2%
For those having 20ms+ latency, half of them are due to waiting for
pg lock (https://github.com/ceph/ceph/blob/dumpling/src/osd/OSD.cc#L7089),
another half are yet to be investigated.
The PG lock conflict means that there's something else happening in
the PG at the same time; that's a logical contention issue. However,
20ms is a long time, so if you can figure out what else the PG is
doing during that time it'd be interesting.
Post by Guang Yang
2. Get file xattr ('-'), which open the file and populate fd cache
(https://github.com/ceph/ceph/blob/dumpling/src/os/FileStore.cc#L230).
0 - 20 ms, 80%
20 - 50 ms, 8%
50 - 100 ms, 7%
100ms+ , 5%
The latency either comes from (from more to less): file path lookup
(https://github.com/ceph/ceph/blob/dumpling/src/os/HashIndex.cc#L294), file
open, or fd cache lookup /add.
Currently objects are store in level 6 or level 7 folder (due to
http://tracker.ceph.com/issues/7207, I stopped folder splitting).
FYI there's been some community and Inktank work to try and speed this
up recently. None of it's been merged into master yet, but we'll
definitely have some improvements to this post-Firefly.
-Greg
Post by Guang Yang
3. Get more xattrs, this is fast due to previous fd cache (rarely >
1ms).
4. Read the data.
0 - 20 ms, 84%
20 - 50 ms, 10%
50 - 100 ms, 4%
100ms+ , 2%
I decreased vfs_cache_pressure from its default value 100 to 5 to make VFS
favor dentry/inode cache over page cache, unfortunately it does not help.
Long story short, most of the long latency read op comes from file system
call (for cold data), as our workload mainly stores objects less than 500KB,
so that it generates a large bunch of objects.
I would like to ask if people experienced similar issue and if there is any
suggestion I can try to boost the GET performance. On the other hand, PUT
could be sacrificed.
Thanks,
Guang
_______________________________________________
ceph-users mailing list
ceph-users at lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Loading...