[ceph-users] How to get RBD volume to PG mapping?

Discussion:

Межов Игорь Александрович

2015-09-25 13:07:42 UTC

Hi!

Last week I wrote, that one PG in our Firefly stuck in degraded state with 2 replicas instead of 3
and do not try to backfill or recovery. We try to investigate, what RBD vol's are affected.

The working plan are inspired by Sebastian Han's snippet
(http://www.sebastien-han.fr/blog/2013/11/19/ceph-rbd-objects-placement/)
and consists of next steps:

1. rbd -p <pool> ls - to list all RBD volumes on the pool
2. Get RBD prefix, corresponding the volume
3. Get a list of objects, which belongs to our RBD volume
4. Issue 'ceph osd map <pool> <objectname>' to get PG for object and OSD placement

After writing some scripts we face a difficulty: running 'ceph osd map...' and getting object
placement takes about 0.5 second, so iterating all 15 millions objects will take forever.

Is there any other way to find to what PGs the specified RBD volume are mapped,
or may be there is a much faster way to do our step 4 instead of calling 'ceph osd map'
in loop for every object.

Thanks!

Megov Igor
CIO, Yuterra

Jan Schermer

2015-09-25 14:06:47 UTC

Permalink

Try:

ceph osd map <pool> <rbdname>

Is that it?

Jan

> On 25 Sep 2015, at 15:07, Межов Игорь Александрович <***@yuterra.ru> wrote:
>
> Hi!
>
> Last week I wrote, that one PG in our Firefly stuck in degraded state with 2 replicas instead of 3
> and do not try to backfill or recovery. We try to investigate, what RBD vol's are affected.
>
> The working plan are inspired by Sebastian Han's snippet
> (http://www.sebastien-han.fr/blog/2013/11/19/ceph-rbd-objects-placement/)
> and consists of next steps:
>
> 1. rbd -p <pool> ls - to list all RBD volumes on the pool
> 2. Get RBD prefix, corresponding the volume
> 3. Get a list of objects, which belongs to our RBD volume
> 4. Issue 'ceph osd map <pool> <objectname>' to get PG for object and OSD placement
>
> After writing some scripts we face a difficulty: running 'ceph osd map...' and getting object
> placement takes about 0.5 second, so iterating all 15 millions objects will take forever.
>
> Is there any other way to find to what PGs the specified RBD volume are mapped,
> or may be there is a much faster way to do our step 4 instead of calling 'ceph osd map'
> in loop for every object.
>
>
> Thanks!
>
> Megov Igor
> CIO, Yuterra
> _______________________________________________
> ceph-users mailing list
> ceph-***@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Jan Schermer

2015-09-25 14:11:36 UTC

Permalink

Ouch
1) I should have read it completely
2) I should have tested it :)
Sorry about that...

You could get the name prefix for each RBD from rbd info, then list all objects (run find on the osds?) and then you just need to grep the OSDs for each prefix... Should be much faster?

Jan

> On 25 Sep 2015, at 15:07, Межов Игорь Александрович <***@yuterra.ru> wrote:
>
> Hi!
>
> Last week I wrote, that one PG in our Firefly stuck in degraded state with 2 replicas instead of 3
> and do not try to backfill or recovery. We try to investigate, what RBD vol's are affected.
>
> The working plan are inspired by Sebastian Han's snippet
> (http://www.sebastien-han.fr/blog/2013/11/19/ceph-rbd-objects-placement/)
> and consists of next steps:
>
> 1. rbd -p <pool> ls - to list all RBD volumes on the pool
> 2. Get RBD prefix, corresponding the volume
> 3. Get a list of objects, which belongs to our RBD volume
> 4. Issue 'ceph osd map <pool> <objectname>' to get PG for object and OSD placement
>
> After writing some scripts we face a difficulty: running 'ceph osd map...' and getting object
> placement takes about 0.5 second, so iterating all 15 millions objects will take forever.
>
> Is there any other way to find to what PGs the specified RBD volume are mapped,
> or may be there is a much faster way to do our step 4 instead of calling 'ceph osd map'
> in loop for every object.
>
>
> Thanks!
>
> Megov Igor
> CIO, Yuterra
> _______________________________________________
> ceph-users mailing list
> ceph-***@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

David Burley

2015-09-25 14:15:47 UTC

Permalink

So I had two ideas here:

1. Use find as Jan suggested. You probably can bound it by the expected
object naming and limit it to the OSDs that were impacted. This is probably
the best way.
2. Use the osdmaptool against a copy of the osdmap that you pre-grab from
the cluster, ala:
https://www.hastexo.com/resources/hints-and-kinks/which-osd-stores-specific-rados-object

--David

On Fri, Sep 25, 2015 at 10:11 AM, Jan Schermer <***@schermer.cz> wrote:

> Ouch
> 1) I should have read it completely
> 2) I should have tested it :)
> Sorry about that...
>
> You could get the name prefix for each RBD from rbd info, then list all
> objects (run find on the osds?) and then you just need to grep the OSDs for
> each prefix... Should be much faster?
>
> Jan
>
>
>
> > On 25 Sep 2015, at 15:07, ÐÐµÐ¶ÐŸÐ² ÐÐ³ÐŸÑÑ ÐÐ»ÐµÐºÑÐ°ÐœÐŽÑÐŸÐ²ÐžÑ <***@yuterra.ru>
> wrote:
> >
> > Hi!
> >
> > Last week I wrote, that one PG in our Firefly stuck in degraded state
> with 2 replicas instead of 3
> > and do not try to backfill or recovery. We try to investigate, what RBD
> vol's are affected.
> >
> > The working plan are inspired by Sebastian Han's snippet
> > (http://www.sebastien-han.fr/blog/2013/11/19/ceph-rbd-objects-placement/
> )
> > and consists of next steps:
> >
> > 1. rbd -p <pool> ls - to list all RBD volumes on the pool
> > 2. Get RBD prefix, corresponding the volume
> > 3. Get a list of objects, which belongs to our RBD volume
> > 4. Issue 'ceph osd map <pool> <objectname>' to get PG for object and OSD
> placement
> >
> > After writing some scripts we face a difficulty: running 'ceph osd
> map...' and getting object
> > placement takes about 0.5 second, so iterating all 15 millions objects
> will take forever.
> >
> > Is there any other way to find to what PGs the specified RBD volume are
> mapped,
> > or may be there is a much faster way to do our step 4 instead of calling
> 'ceph osd map'
> > in loop for every object.
> >
> >
> > Thanks!
> >
> > Megov Igor
> > CIO, Yuterra
> > _______________________________________________
> > ceph-users mailing list
> > ceph-***@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> _______________________________________________
> ceph-users mailing list
> ceph-***@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

--
David Burley
NOC Manager, Sr. Systems Programmer/Analyst
Slashdot Media

e: ***@slashdotmedia.com

Межов Игорь Александрович

2015-09-25 14:53:41 UTC

Permalink

Hi!

Thanks!

I have some suggestions for the 1st method:

>You could get the name prefix for each RBD from rbd info,
Yes, I did it already at the steps 1 and 2. I forgot to mention, that I grab rbd frefix from 'rbd info' command

>then list all objects (run find on the osds?) and then you just need to grep the OSDs for each prefix.
So, you advise to run find over ssh for all OSD hosts to traverse OSDs filesystems and find files (objects),
named with rbd prefix? Am I right? If so, I have two thoughts: (1) it may be not so fast also, because
even limiting find with rbd prefix and pool index, it have to recursively go through whole OSD filesytem
hierarchy. And (2) - find will put an additional load to OSD drives.

The second method is more attractive and I will try it soon. As we have an object name,
and can get a crushmap in some usable form to look by ourself, or indirectly through a
library/api call - finding the chain of object-to-PG-to-OSDs is a local computational
task, and it can be done without remote calls (accessing OSD hosts, finding, etc).

Also, the slow looping through 'ceph osd map <pool> <object>' could be explained:
for every object we have to spawn process, connecting cluster (with auth), receiving
maps to client, calculating placement, and ... finally throw it all away when process
exits. I think this overhead is a main reason of slowness.

Megov Igor
CIO, Yuterra

________________________________
ïÔ: ceph-users <ceph-users-***@lists.ceph.com> ÏÔ ÉÍÅÎÉ David Burley <***@slashdotmedia.com>
ïÔÐÒÁ×ÌÅÎÏ: 25 ÓÅÎÔÑÂÒÑ 2015 Ç. 17:15
ëÏÍÕ: Jan Schermer
ëÏÐÉÑ: ceph-users; íÅÖÏ× éÇÏÒØ áÌÅËÓÁÎÄÒÏ×ÉÞ
ôÅÍÁ: Re: [ceph-users] How to get RBD volume to PG mapping?

So I had two ideas here:

1. Use find as Jan suggested. You probably can bound it by the expected object naming and limit it to the OSDs that were impacted. This is probably the best way.
2. Use the osdmaptool against a copy of the osdmap that you pre-grab from the cluster, ala: https://www.hastexo.com/resources/hints-and-kinks/which-osd-stores-specific-rados-object

--David

On Fri, Sep 25, 2015 at 10:11 AM, Jan Schermer <***@schermer.cz<mailto:***@schermer.cz>> wrote:
Ouch
1) I should have read it completely
2) I should have tested it :)
Sorry about that...

You could get the name prefix for each RBD from rbd info, then list all objects (run find on the osds?) and then you just need to grep the OSDs for each prefix... Should be much faster?

Jan

> On 25 Sep 2015, at 15:07, íÅÖÏ× éÇÏÒØ áÌÅËÓÁÎÄÒÏ×ÉÞ <***@yuterra.ru<mailto:***@yuterra.ru>> wrote:
>
> Hi!
>
> Last week I wrote, that one PG in our Firefly stuck in degraded state with 2 replicas instead of 3
> and do not try to backfill or recovery. We try to investigate, what RBD vol's are affected.
>
> The working plan are inspired by Sebastian Han's snippet
> (http://www.sebastien-han.fr/blog/2013/11/19/ceph-rbd-objects-placement/)
> and consists of next steps:
>
> 1. rbd -p <pool> ls - to list all RBD volumes on the pool
> 2. Get RBD prefix, corresponding the volume
> 3. Get a list of objects, which belongs to our RBD volume
> 4. Issue 'ceph osd map <pool> <objectname>' to get PG for object and OSD placement
>
> After writing some scripts we face a difficulty: running 'ceph osd map...' and getting object
> placement takes about 0.5 second, so iterating all 15 millions objects will take forever.
>
> Is there any other way to find to what PGs the specified RBD volume are mapped,
> or may be there is a much faster way to do our step 4 instead of calling 'ceph osd map'
> in loop for every object.
>
>
> Thanks!
>
> Megov Igor
> CIO, Yuterra
> _______________________________________________
> ceph-users mailing list
> ceph-***@lists.ceph.com<mailto:ceph-***@lists.ceph.com>
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-***@lists.ceph.com<mailto:ceph-***@lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

--
David Burley
NOC Manager, Sr. Systems Programmer/Analyst
Slashdot Media

e: ***@slashdotmedia.com<mailto:***@slashdotmedia.com>

David Burley

2015-09-25 15:00:55 UTC

Permalink

> >then list all objects (run find on the osds?) and then you just need to
> grep the OSDs for each prefix.
> So, you advise to run find over ssh for all OSD hosts to traverse OSDs
> filesystems and find files (objects),
> named with rbd prefix? Am I right? If so, I have two thoughts: (1) it may
> be not so fast also, because
> even limiting find with rbd prefix and pool index, it have to recursively
> go through whole OSD filesytem
> hierarchy. And (2) - find will put an additional load to OSD drives.
>
>
Given you know the PGs impacted, you would limit this find to only those
PGs in question, which from what you have described is only 1. So figure
out which OSDs are active for the PG, and run the find in the subdir for
the placement group on one of those. It should run really fast unless you
have tons of tiny objects in the PG.

--
David Burley
NOC Manager, Sr. Systems Programmer/Analyst
Slashdot Media

e: ***@slashdotmedia.com

Jan Schermer

2015-09-25 15:13:46 UTC

Permalink

> On 25 Sep 2015, at 16:53, ÐÐµÐ¶ÐŸÐ² ÐÐ³ÐŸÑÑ ÐÐ»ÐµÐºÑÐ°ÐœÐŽÑÐŸÐ²ÐžÑ <***@yuterra.ru> wrote:
>
> Hi!
>
> Thanks!
>
> I have some suggestions for the 1st method:
>
> >You could get the name prefix for each RBD from rbd info,
> Yes, I did it already at the steps 1 and 2. I forgot to mention, that I grab rbd frefix from 'rbd info' command
>
>
> >then list all objects (run find on the osds?) and then you just need to grep the OSDs for each prefix.
> So, you advise to run find over ssh for all OSD hosts to traverse OSDs filesystems and find files (objects),
> named with rbd prefix? Am I right? If so, I have two thoughts: (1) it may be not so fast also, because
> even limiting find with rbd prefix and pool index, it have to recursively go through whole OSD filesytem
> hierarchy. And (2) - find will put an additional load to OSD drives.
>

This should be fast, the hierarchy is pretty simple and the objects fairly large.
Save it to a file and then you can grep it for anything.

But yeah, the second option is probably more robust.

>
> The second method is more attractive and I will try it soon. As we have an object name,
> and can get a crushmap in some usable form to look by ourself, or indirectly through a
> library/api call - finding the chain of object-to-PG-to-OSDs is a local computational
> task, and it can be done without remote calls (accessing OSD hosts, finding, etc).
>
> Also, the slow looping through 'ceph osd map <pool> <object>' could be explained:
> for every object we have to spawn process, connecting cluster (with auth), receiving
> maps to client, calculating placement, and ... finally throw it all away when process
> exits. I think this overhead is a main reason of slowness.
>
>
> Megov Igor
> CIO, Yuterra
>
>
> ÐÑ: ceph-users <ceph-users-***@lists.ceph.com <mailto:ceph-users-***@lists.ceph.com>> ÐŸÑ ÐžÐŒÐµÐœÐž David Burley <***@slashdotmedia.com <mailto:***@slashdotmedia.com>>
> ÐÑÐ¿ÑÐ°Ð²Ð»ÐµÐœÐŸ: 25 ÑÐµÐœÑÑÐ±ÑÑ 2015 Ð³. 17:15
> ÐÐŸÐŒÑ: Jan Schermer
> ÐÐŸÐ¿ÐžÑ: ceph-users; ÐÐµÐ¶ÐŸÐ² ÐÐ³ÐŸÑÑ ÐÐ»ÐµÐºÑÐ°ÐœÐŽÑÐŸÐ²ÐžÑ
> Ð¢ÐµÐŒÐ°: Re: [ceph-users] How to get RBD volume to PG mapping?
>
> So I had two ideas here:
>
> 1. Use find as Jan suggested. You probably can bound it by the expected object naming and limit it to the OSDs that were impacted. This is probably the best way.
> 2. Use the osdmaptool against a copy of the osdmap that you pre-grab from the cluster, ala: https://www.hastexo.com/resources/hints-and-kinks/which-osd-stores-specific-rados-object <https://www.hastexo.com/resources/hints-and-kinks/which-osd-stores-specific-rados-object>
>
> --David
>
> On Fri, Sep 25, 2015 at 10:11 AM, Jan Schermer <***@schermer.cz <mailto:***@schermer.cz>> wrote:
> Ouch
> 1) I should have read it completely
> 2) I should have tested it :)
> Sorry about that...
>
> You could get the name prefix for each RBD from rbd info, then list all objects (run find on the osds?) and then you just need to grep the OSDs for each prefix... Should be much faster?
>
> Jan
>
>
>
> > On 25 Sep 2015, at 15:07, ÐÐµÐ¶ÐŸÐ² ÐÐ³ÐŸÑÑ ÐÐ»ÐµÐºÑÐ°ÐœÐŽÑÐŸÐ²ÐžÑ <***@yuterra.ru <mailto:***@yuterra.ru>> wrote:
> >
> > Hi!
> >
> > Last week I wrote, that one PG in our Firefly stuck in degraded state with 2 replicas instead of 3
> > and do not try to backfill or recovery. We try to investigate, what RBD vol's are affected.
> >
> > The working plan are inspired by Sebastian Han's snippet
> > (http://www.sebastien-han.fr/blog/2013/11/19/ceph-rbd-objects-placement/ <http://www.sebastien-han.fr/blog/2013/11/19/ceph-rbd-objects-placement/>)
> > and consists of next steps:
> >
> > 1. rbd -p <pool> ls - to list all RBD volumes on the pool
> > 2. Get RBD prefix, corresponding the volume
> > 3. Get a list of objects, which belongs to our RBD volume
> > 4. Issue 'ceph osd map <pool> <objectname>' to get PG for object and OSD placement
> >
> > After writing some scripts we face a difficulty: running 'ceph osd map...' and getting object
> > placement takes about 0.5 second, so iterating all 15 millions objects will take forever.
> >
> > Is there any other way to find to what PGs the specified RBD volume are mapped,
> > or may be there is a much faster way to do our step 4 instead of calling 'ceph osd map'
> > in loop for every object.
> >
> >
> > Thanks!
> >
> > Megov Igor
> > CIO, Yuterra
> > _______________________________________________
> > ceph-users mailing list
> > ceph-***@lists.ceph.com <mailto:ceph-***@lists.ceph.com>
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
>
> _______________________________________________
> ceph-users mailing list
> ceph-***@lists.ceph.com <mailto:ceph-***@lists.ceph.com>
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
>
>
>
> --
> David Burley
> NOC Manager, Sr. Systems Programmer/Analyst
> Slashdot Media
>
> e: ***@slashdotmedia.com <mailto:***@slashdotmedia.com>

Ilya Dryomov

2015-09-25 15:21:13 UTC

Permalink

On Fri, Sep 25, 2015 at 5:53 PM, Межов Игорь Александрович
<***@yuterra.ru> wrote:
> Hi!
>
> Thanks!
>
> I have some suggestions for the 1st method:
>
>>You could get the name prefix for each RBD from rbd info,
> Yes, I did it already at the steps 1 and 2. I forgot to mention, that I grab
> rbd frefix from 'rbd info' command
>
>
>>then list all objects (run find on the osds?) and then you just need to
>> grep the OSDs for each prefix.
> So, you advise to run find over ssh for all OSD hosts to traverse OSDs
> filesystems and find files (objects),
> named with rbd prefix? Am I right? If so, I have two thoughts: (1) it may be
> not so fast also, because
> even limiting find with rbd prefix and pool index, it have to recursively go
> through whole OSD filesytem
> hierarchy. And (2) - find will put an additional load to OSD drives.
>
>
> The second method is more attractive and I will try it soon. As we have an
> object name,
> and can get a crushmap in some usable form to look by ourself, or indirectly
> through a
> library/api call - finding the chain of object-to-PG-to-OSDs is a local
> computational
> task, and it can be done without remote calls (accessing OSD hosts, finding,
> etc).
>
> Also, the slow looping through 'ceph osd map <pool> <object>' could be
> explained:
> for every object we have to spawn process, connecting cluster (with auth),
> receiving
> maps to client, calculating placement, and ... finally throw it all away
> when process
> exits. I think this overhead is a main reason of slowness.

Internally there is a way to list objects within a specific PG
(actually more than one way IIRC), but I don't think anything like that
is exposed in a CLI (it might be exposed in librados though). Grabbing
an osdmap and iterating with osdmaptool --test-map-object over
rbd_data.<prefix>.* is probably the fastest way for you to get what you
want.

Thanks,

Ilya

Межов Игорь Александрович

2015-09-28 08:19:45 UTC

Permalink

Hi!

Ilya Dryomov wrote:
>Internally there is a way to list objects within a specific PG
>(actually more than one way IIRC), but I don't think anything like that
>is exposed in a CLI (it might be exposed in librados though). Grabbing
>an osdmap and iterating with osdmaptool --test-map-object over
>rbd_data.<prefix>.* is probably the fastest way for you to get what you
>want.

Yes, I dumped osdmap, did 'rados ls' for all objects into a file and started simple
shell script, that read object list and run osdmaptool. It is surprisingly slow -
still running from Friday afternoon and process only 5.000.000 objects
from over the 11.000.000. So maybe I'll try to dig deeper in librados
headers to write some homemade tool.

David Burley wrote:
>So figure out which OSDs are active for the PG, and run the find in the subdir
>for the placement group on one of those. It should run really fast unless you
>have tons of tiny objects in the PG.

I think finding objects in directory structure is a good way, but only for healthy
cluster, where object placement are not changing. In my case, for a strange reason,
I can't figure all three OSD for this one PG. After a node crash I have this one PG
in degraded state, it have only two replicas, while pool min_size=3.

And more strange is that I cant force it to repair - neither 'ceph pg repair', nor OSD
restart didn't help me to recover PG. In health detail I can see only two OSDs for this PG.

Megov Igor
CIO, Yuterra

________________________________________
От: Ilya Dryomov <***@gmail.com>
Отправлено: 25 сентября 2015 г. 18:21
Кому: Межов Игорь Александрович
Копия: David Burley; Jan Schermer; ceph-users
Тема: Re: [ceph-users] НА: How to get RBD volume to PG mapping?

On Fri, Sep 25, 2015 at 5:53 PM, Межов Игорь Александрович
<***@yuterra.ru> wrote:
> Hi!
>
> Thanks!
>
> I have some suggestions for the 1st method:
>
>>You could get the name prefix for each RBD from rbd info,
> Yes, I did it already at the steps 1 and 2. I forgot to mention, that I grab
> rbd frefix from 'rbd info' command
>
>
>>then list all objects (run find on the osds?) and then you just need to
>> grep the OSDs for each prefix.
> So, you advise to run find over ssh for all OSD hosts to traverse OSDs
> filesystems and find files (objects),
> named with rbd prefix? Am I right? If so, I have two thoughts: (1) it may be
> not so fast also, because
> even limiting find with rbd prefix and pool index, it have to recursively go
> through whole OSD filesytem
> hierarchy. And (2) - find will put an additional load to OSD drives.
>
>
> The second method is more attractive and I will try it soon. As we have an
> object name,
> and can get a crushmap in some usable form to look by ourself, or indirectly
> through a
> library/api call - finding the chain of object-to-PG-to-OSDs is a local
> computational
> task, and it can be done without remote calls (accessing OSD hosts, finding,
> etc).
>
> Also, the slow looping through 'ceph osd map <pool> <object>' could be
> explained:
> for every object we have to spawn process, connecting cluster (with auth),
> receiving
> maps to client, calculating placement, and ... finally throw it all away
> when process
> exits. I think this overhead is a main reason of slowness.

Internally there is a way to list objects within a specific PG
(actually more than one way IIRC), but I don't think anything like that
is exposed in a CLI (it might be exposed in librados though). Grabbing
an osdmap and iterating with osdmaptool --test-map-object over
rbd_data.<prefix>.* is probably the fastest way for you to get what you
want.

Thanks,

Ilya