[ceph-users] Recovering incomplete PGs with ceph_objectstore

Discussion:

[ceph-users] Recovering incomplete PGs with ceph_objectstore_tool

Chris Kitzmiller

2015-04-03 02:36:58 UTC

I have a cluster running 0.80.9 on Ubuntu 14.04. A couple nights ago I lost two disks from a pool with size=2. :(

I replaced the two failed OSDs and I now have two PGs which are marked as incomplete in an otherwise healthy cluster. Following this page ( https://ceph.com/community/incomplete-pgs-oh-my/ ) I was able to set up another node and install Giant 0.87.1, mount one of my failed OSD drives and successfully export the two PGs. I set up another OSD on my new node, weighted it to zero, and imported the two PGs.

I'm still stuck though. It seems as though the new OSD just doesn't want to share with the other OSDs. Is there any way for me to ask an OSD which PGs it has (rather than ask the MON which OSDs a PG is on) to verify that my import was good? Help!

0 and 15 were the OSDs I lost. 30 is the new OSD. I've currently got size = 2, min_size = 1.

***@storage1:~# ceph pg dump | grep incomplete | column -t
dumped all in format plain
3.102 0 0 0 0 0 0 0 incomplete 2015-04-02 20:49:32.529594 0'0 15730:21 [0,15] 0 [0,15] 0 13985'53107 2015-03-29 21:17:15.568125 13985'49195 2015-03-24 18:38:08.244769
3.c7 0 0 0 0 0 0 0 incomplete 2015-04-02 20:49:32.968841 0'0 15730:17 [15,0] 15 [15,0] 15 13985'54076 2015-03-31 19:14:22.721695 13985'54076 2015-03-31 19:14:22.721695

***@storage1:~# ceph health detail
HEALTH_WARN 2 pgs incomplete; 2 pgs stuck inactive; 2 pgs stuck unclean; 1 requests are blocked > 32 sec; 1 osds have slow requests
pg 3.c7 is stuck inactive since forever, current state incomplete, last acting [15,0]
pg 3.102 is stuck inactive since forever, current state incomplete, last acting [0,15]
pg 3.c7 is stuck unclean since forever, current state incomplete, last acting [15,0]
pg 3.102 is stuck unclean since forever, current state incomplete, last acting [0,15]
pg 3.102 is incomplete, acting [0,15]
pg 3.c7 is incomplete, acting [15,0]
1 ops are blocked > 8388.61 sec
1 ops are blocked > 8388.61 sec on osd.15
1 osds have slow requests

***@storage1:~# ceph osd tree
# id weight type name up/down reweight
-1 81.65 root default
-2 81.65 host storage1
-3 13.63 journal storage1-journal1
1 2.72 osd.1 up 1
4 2.72 osd.4 up 1
2 2.73 osd.2 up 1
3 2.73 osd.3 up 1
0 2.73 osd.0 up 1
-4 13.61 journal storage1-journal2
5 2.72 osd.5 up 1
6 2.72 osd.6 up 1
8 2.72 osd.8 up 1
9 2.72 osd.9 up 1
7 2.73 osd.7 up 1
-5 13.6 journal storage1-journal3
11 2.72 osd.11 up 1
12 2.72 osd.12 up 1
13 2.72 osd.13 up 1
14 2.72 osd.14 up 1
10 2.72 osd.10 up 1
-6 13.61 journal storage1-journal4
16 2.72 osd.16 up 1
17 2.72 osd.17 up 1
18 2.72 osd.18 up 1
19 2.72 osd.19 up 1
15 2.73 osd.15 up 1
-7 13.6 journal storage1-journal5
20 2.72 osd.20 up 1
21 2.72 osd.21 up 1
22 2.72 osd.22 up 1
23 2.72 osd.23 up 1
24 2.72 osd.24 up 1
-8 13.6 journal storage1-journal6
25 2.72 osd.25 up 1
26 2.72 osd.26 up 1
27 2.72 osd.27 up 1
28 2.72 osd.28 up 1
29 2.72 osd.29 up 1
-9 0 host ithome
30 0 osd.30 up 1

LOPEZ Jean-Charles

2015-04-03 04:37:05 UTC

Permalink

Hi Chris,

according to your ceph osd tree capture, although the OSD reweight is set to 1, the OSD CRUSH weight is set to 0 (2nd column). You need to assign the OSD a CRUSH weight so that it can be selected by CRUSH: ceph osd crush reweight osd.30 x.y (where 1.0=1TB)

Only when this is done will you see if it joins.

JC

> On 2 Apr 2015, at 19:36, Chris Kitzmiller <***@hampshire.edu> wrote:
>
> I have a cluster running 0.80.9 on Ubuntu 14.04. A couple nights ago I lost two disks from a pool with size=2. :(
>
> I replaced the two failed OSDs and I now have two PGs which are marked as incomplete in an otherwise healthy cluster. Following this page ( https://ceph.com/community/incomplete-pgs-oh-my/ <https://ceph.com/community/incomplete-pgs-oh-my/> ) I was able to set up another node and install Giant 0.87.1, mount one of my failed OSD drives and successfully export the two PGs. I set up another OSD on my new node, weighted it to zero, and imported the two PGs.
>
> I'm still stuck though. It seems as though the new OSD just doesn't want to share with the other OSDs. Is there any way for me to ask an OSD which PGs it has (rather than ask the MON which OSDs a PG is on) to verify that my import was good? Help!
>
> 0 and 15 were the OSDs I lost. 30 is the new OSD. I've currently got size = 2, min_size = 1.
>
> ***@storage1:~# ceph pg dump | grep incomplete | column -t
> dumped all in format plain
> 3.102 0 0 0 0 0 0 0 incomplete 2015-04-02 20:49:32.529594 0'0 15730:21 [0,15] 0 [0,15] 0 13985'53107 2015-03-29 21:17:15.568125 13985'49195 2015-03-24 18:38:08.244769
> 3.c7 0 0 0 0 0 0 0 incomplete 2015-04-02 20:49:32.968841 0'0 15730:17 [15,0] 15 [15,0] 15 13985'54076 2015-03-31 19:14:22.721695 13985'54076 2015-03-31 19:14:22.721695
>
> ***@storage1:~# ceph health detail
> HEALTH_WARN 2 pgs incomplete; 2 pgs stuck inactive; 2 pgs stuck unclean; 1 requests are blocked > 32 sec; 1 osds have slow requests
> pg 3.c7 is stuck inactive since forever, current state incomplete, last acting [15,0]
> pg 3.102 is stuck inactive since forever, current state incomplete, last acting [0,15]
> pg 3.c7 is stuck unclean since forever, current state incomplete, last acting [15,0]
> pg 3.102 is stuck unclean since forever, current state incomplete, last acting [0,15]
> pg 3.102 is incomplete, acting [0,15]
> pg 3.c7 is incomplete, acting [15,0]
> 1 ops are blocked > 8388.61 sec
> 1 ops are blocked > 8388.61 sec on osd.15
> 1 osds have slow requests
>
> ***@storage1:~# ceph osd tree
> # id weight type name up/down reweight
> -1 81.65 root default
> -2 81.65 host storage1
> -3 13.63 journal storage1-journal1
> 1 2.72 osd.1 up 1
> 4 2.72 osd.4 up 1
> 2 2.73 osd.2 up 1
> 3 2.73 osd.3 up 1
> 0 2.73 osd.0 up 1
> -4 13.61 journal storage1-journal2
> 5 2.72 osd.5 up 1
> 6 2.72 osd.6 up 1
> 8 2.72 osd.8 up 1
> 9 2.72 osd.9 up 1
> 7 2.73 osd.7 up 1
> -5 13.6 journal storage1-journal3
> 11 2.72 osd.11 up 1
> 12 2.72 osd.12 up 1
> 13 2.72 osd.13 up 1
> 14 2.72 osd.14 up 1
> 10 2.72 osd.10 up 1
> -6 13.61 journal storage1-journal4
> 16 2.72 osd.16 up 1
> 17 2.72 osd.17 up 1
> 18 2.72 osd.18 up 1
> 19 2.72 osd.19 up 1
> 15 2.73 osd.15 up 1
> -7 13.6 journal storage1-journal5
> 20 2.72 osd.20 up 1
> 21 2.72 osd.21 up 1
> 22 2.72 osd.22 up 1
> 23 2.72 osd.23 up 1
> 24 2.72 osd.24 up 1
> -8 13.6 journal storage1-journal6
> 25 2.72 osd.25 up 1
> 26 2.72 osd.26 up 1
> 27 2.72 osd.27 up 1
> 28 2.72 osd.28 up 1
> 29 2.72 osd.29 up 1
> -9 0 host ithome
> 30 0 osd.30 up 1
>
> _______________________________________________
> ceph-users mailing list
> ceph-***@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Chris Kitzmiller

2015-04-04 12:12:33 UTC

Permalink

On Apr 3, 2015, at 12:37 AM, LOPEZ Jean-Charles <***@redhat.com> wrote:
> according to your ceph osd tree capture, although the OSD reweight is set to 1, the OSD CRUSH weight is set to 0 (2nd column). You need to assign the OSD a CRUSH weight so that it can be selected by CRUSH: ceph osd crush reweight osd.30 x.y (where 1.0=1TB)
>
> Only when this is done will you see if it joins.

I don't really want osd.30 to join my cluster though. It is a purely temporary device that I restored just those two PGs to. It should still be able to (and be trying to) push out those two PGs with a weight of zero, right? I don't want any of my production data to migrate towards osd.30.

Chris Kitzmiller

2015-04-03 05:20:55 UTC

Permalink

On Apr 3, 2015, at 12:37 AM, LOPEZ Jean-Charles <***@redhat.com> wrote:
>
> according to your ceph osd tree capture, although the OSD reweight is set to 1, the OSD CRUSH weight is set to 0 (2nd column). You need to assign the OSD a CRUSH weight so that it can be selected by CRUSH: ceph osd crush reweight osd.30 x.y (where 1.0=1TB)
>
> Only when this is done will you see if it joins.

I don't really want osd.30 to join my cluster though. It is a purely temporary device that I restored just those two PGs to. It should still be able to (and be trying to) push out those two PGs with a weight of zero, right? I don't want any of my production data to migrate towards osd.30.

Craig Lewis

2015-04-06 17:49:05 UTC

Permalink

In that case, I'd set the crush weight to the disk's size in TiB, and mark
the osd out:
ceph osd crush reweight osd.<OSDID> <weight>
ceph osd out <OSDID>

Then your tree should look like:
-9 *2.72* host ithome
30 *2.72* osd.30 up *0*

An OSD can be UP and OUT, which causes Ceph to migrate all of it's data
away.

On Thu, Apr 2, 2015 at 10:20 PM, Chris Kitzmiller <***@hampshire.edu>
wrote:

> On Apr 3, 2015, at 12:37 AM, LOPEZ Jean-Charles <***@redhat.com>
> wrote:
> >
> > according to your ceph osd tree capture, although the OSD reweight is
> set to 1, the OSD CRUSH weight is set to 0 (2nd column). You need to assign
> the OSD a CRUSH weight so that it can be selected by CRUSH: ceph osd crush
> reweight osd.30 x.y (where 1.0=1TB)
> >
> > Only when this is done will you see if it joins.
>
> I don't really want osd.30 to join my cluster though. It is a purely
> temporary device that I restored just those two PGs to. It should still be
> able to (and be trying to) push out those two PGs with a weight of zero,
> right? I don't want any of my production data to migrate towards osd.30.
> _______________________________________________
> ceph-users mailing list
> ceph-***@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

Chris Kitzmiller

2015-04-07 03:43:19 UTC

Permalink

On Apr 6, 2015, at 1:49 PM, Craig Lewis <***@centraldesktop.com> wrote:
> In that case, I'd set the crush weight to the disk's size in TiB, and mark the osd out:
> ceph osd crush reweight osd.<OSDID> <weight>
> ceph osd out <OSDID>
>
> Then your tree should look like:
> -9 2.72 host ithome
> 30 2.72 osd.30 up 0
>
> An OSD can be UP and OUT, which causes Ceph to migrate all of it's data away.

Thanks, Craig, but that didn’t seem to work. Some *other* PGs moved around in my cluster but not those on osd.30. Craig, both you and Jean-Charles suggested I up the weight of the temporary OSD to get the PGs there to migrate off which I don’t really understand and is counter to the article I read here: https://ceph.com/community/incomplete-pgs-oh-my/ I’m not sure why the OSD would need any weight to have it’s PGs move off of it back into the cluster.

As this didn’t work I’m going to try the following unless someone tells me it’s a bad idea:

* Move osd.0 from storage1 over to Ithome
* Start osd.0 there and then stop it (to cover any upgrade process because storage1 is 0.80.9 and Ithome is 0.87.1)
* Use ceph_objectstore_tool to import PGs 3.c7 and 3.102 into osd.0
* If that works:
** Set osd.0 out and have it flush all of its PGs back to the OSDs on storage1.
** Remove osd.0 from the cluster
** Move the drive back to storage1 and re-deploy it as a new 0.80.9 OSD
* If the ceph_objectstore_tool fails (because there’s already an empty / bad copy of those PGs)
** Attempt to remove current/3.c7_head and current/3.102_head and try again

Just for reference again my ceph osd tree looks like:

# id weight type name up/down reweight
-1 81.65 root default
-2 81.65 host storage1
-3 13.63 journal storage1-journal1
1 2.72 osd.1 up 1
4 2.72 osd.4 up 1
2 2.73 osd.2 up 1
3 2.73 osd.3 up 1
0 2.73 osd.0 up 1
-4 13.61 journal storage1-journal2
5 2.72 osd.5 up 1
6 2.72 osd.6 up 1
8 2.72 osd.8 up 1
9 2.72 osd.9 up 1
7 2.73 osd.7 up 1
-5 13.6 journal storage1-journal3
11 2.72 osd.11 up 1
12 2.72 osd.12 up 1
13 2.72 osd.13 up 1
14 2.72 osd.14 up 1
10 2.72 osd.10 up 1
-6 13.61 journal storage1-journal4
16 2.72 osd.16 up 1
17 2.72 osd.17 up 1
18 2.72 osd.18 up 1
19 2.72 osd.19 up 1
15 2.73 osd.15 up 1
-7 13.6 journal storage1-journal5
20 2.72 osd.20 up 1
21 2.72 osd.21 up 1
22 2.72 osd.22 up 1
23 2.72 osd.23 up 1
24 2.72 osd.24 up 1
-8 13.6 journal storage1-journal6
25 2.72 osd.25 up 1
26 2.72 osd.26 up 1
27 2.72 osd.27 up 1
28 2.72 osd.28 up 1
29 2.72 osd.29 up 1
-9 0 host ithome
30 0 osd.30 up 0

The PGs I lost are currently mapped to osd.0 and osd.15. Those are the two drives that failed at the same time in my double replica cluster. PGs 3.c7 and 3.102 are apparently the only two PGs which were on both of those drives. I was able to extract the data from those PGs from one of the dead drives using ceph_objectstore_tool and that seems to have been successful. I imported those PGs using ceph_objectstore_tool to the temporary OSD on Ithome, osd.30. I just can’t seem to get them to migrate from osd.30 back to 0 and 15. I have min_size = 1 and size = 2.

> On Thu, Apr 2, 2015 at 10:20 PM, Chris Kitzmiller <***@hampshire.edu> wrote:
>> On Apr 3, 2015, at 12:37 AM, LOPEZ Jean-Charles <***@redhat.com> wrote:
>>> according to your ceph osd tree capture, although the OSD reweight is set to 1, the OSD CRUSH weight is set to 0 (2nd column). You need to assign the OSD a CRUSH weight so that it can be selected by CRUSH: ceph osd crush reweight osd.30 x.y (where 1.0=1TB)
>>>
>>> Only when this is done will you see if it joins.
>>
>> I don't really want osd.30 to join my cluster though. It is a purely temporary device that I restored just those two PGs to. It should still be able to (and be trying to) push out those two PGs with a weight of zero, right? I don't want any of my production data to migrate towards osd.30.

Chris Kitzmiller

2015-04-07 16:06:51 UTC

Permalink

I'm not having much luck here. Is there a possibility that the imported PGs aren't being picked up because the MONs think that they're older than the empty PGs I find on the up OSDs?

I feel that I'm so close to *not* losing my RBD volume because I only have two bad PGs and I've successfully exported those PGs from my dead drive. So close!

Can someone decipher this and let me know what's up?

***@storage1:~# ceph pg dump | grep down
3.102 0 0 0 0 0 0 0 down+peering 2015-04-07 11:37:59.318222 0'0 15882:34 [17,13] 17 [17,13] 17 13985'53107 2015-03-29 21:17:15.568125 13985'49195 2015-03-24 18:38:08.244769
3.c7 3688 0 0 0 15435374592 3001 3001 down+inconsistent+peering 2015-04-07 11:37:50.498785 13985'54076 15882:276487 [15,7] 15 [15,7] 15 13985'54076 2015-03-31 19:14:22.721695 13985'54076 2015-03-31 19:14:22.721695

***@storage1:~# ceph pg 3.c7 query
http://pastebin.com/raw.php?i=QPVmLSCz

***@storage1:~# ceph pg 3.102 query
http://pastebin.com/raw.php?i=VmawW3xU

***@ithome:~# ceph_objectstore_tool --data-path /var/lib/ceph/osd/ceph-30 --journal /var/lib/ceph/osd/ceph-30/journal --op info --pgid 3.c7
http://pastebin.com/raw.php?i=JVwC509A

***@ithome:~# ceph_objectstore_tool --data-path /var/lib/ceph/osd/ceph-30 --journal /var/lib/ceph/osd/ceph-30/journal --op info --pgid 3.102
http://pastebin.com/raw.php?i=qMisJ6pn

Chris Kitzmiller

2015-04-09 18:11:51 UTC

Permalink

Success! Hopefully my notes from the process will help:

In the event of multiple disk failures the cluster could lose PGs. Should this occur it is best to attempt to restart the OSD process and have the drive marked as up+out. Marking the drive as out will cause data to flow off the drive to elsewhere in the cluster. In the event that the ceph-osd process is unable to keep running you could try using the ceph_objectstore_tool program to extract just the damaged PGs and import them into working PGs.

Fixing Journals
In this particular scenario things were complicated by the fact that ceph_objectstore_tool came out in Giant but we were running Firefly. Not wanting to upgrade the cluster in a degraded state this required that the OSD drives be moved to a different physical machine for repair. This added a lot of steps related to the journals but it wasn't a big deal. That process looks like:

On Storage1:
stop ceph-osd id=15
ceph-osd -i 15 --flush-journal
ls -l /var/lib/ceph/osd/ceph-15/journal

Note the journal device UUID then pull the disk and move it to Ithome:
rm /var/lib/ceph/osd/ceph-15/journal
ceph-osd -i 15 --mkjournal

That creates a colocated journal for which to use during the ceph_objectstore_tool commands. Once done then:
ceph-osd -i 15 --flush-journal
rm /var/lib/ceph/osd/ceph-15/journal

Pull the disk and bring it back to Storage1. Then:
ln -s /dev/disk/by-partitionuuid/b4f8d911-5ac9-4bf0-a06a-b8492e25a00f /var/lib/ceph/osd/ceph-15/journal
ceph-osd -i 15 --mkjournal
start ceph-osd id=15

This all won't be needed once the cluster is running Hammer because then there will be an available version of ceph_objectstore_tool on the local machine and you can keep the journals throughout the process.

Recovery Process
We were missing two PGs, 3.c7 and 3.102. These PGs were hosted on OSD.0 and OSD.15 which were the two disks which failed out of Storage1. The disk for OSD.0 seemed to be a total loss while the disk for OSD.15 was somewhat more cooperative but not in a place to be up and running in the cluster. I took the dying OSD.15 drive and placed it into a new physical machine with a fresh install of Ceph Giant. Using Giant's ceph_objectstore_tool I was able to extract the PGs with a command like:
for i in 3.c7 3.102 ; do ceph_objectstore_tool --data /var/lib/ceph/osd/ceph-15 --journal /var/lib/ceph/osd/ceph-15/journal --op export --pgid $i --file ~/${i}.export

Once both PGs were successfully exported I attempted to import them into a new temporary OSD following instructions from here. For some reason that didn't work. The OSD was up+in but wasn't backfilling the PGs into the cluster. If you find yourself in this process I would try that first just in case it provides a cleaner process.
Considering the above didn't work and we were looking at the possibility of losing the RBD volume (or perhaps worse, the potential of fruitlessly fscking 35TB) I took what I might describe as heroic measures:

Running
ceph pg dump | grep incomplete

3.c7 0 0 0 0 0 0 0 incomplete 2015-04-02 20:49:32.968841 0'0 15730:17 [15,0] 15 [15,0] 15 13985'54076 2015-03-31 19:14:22.721695 13985'54076 2015-03-31 19:14:22.721695
3.102 0 0 0 0 0 0 0 incomplete 2015-04-02 20:49:32.529594 0'0 15730:21 [0,15] 0 [0,15] 0 13985'53107 2015-03-29 21:17:15.568125 13985'49195 2015-03-24 18:38:08.244769

Then I stopped all OSDs, which blocked all I/O to the cluster, with:
stop ceph-osd-all

Then I looked for all copies of the PG on all OSDs with:
for i in 3.c7 3.102 ; do find /var/lib/ceph/osd/ -maxdepth 3 -type d -name "$i" ; done | sort -V

/var/lib/ceph/osd/ceph-0/current/3.c7_head
/var/lib/ceph/osd/ceph-0/current/3.102_head
/var/lib/ceph/osd/ceph-3/current/3.c7_head
/var/lib/ceph/osd/ceph-13/current/3.102_head
/var/lib/ceph/osd/ceph-15/current/3.c7_head
/var/lib/ceph/osd/ceph-15/current/3.102_head

Then I flushed the journals for all of those OSDs with:
for i in 0 3 13 15 ; do ceph-osd -i $i --flush-journal ; done

Then I removed all of those drives and moved them (using Journal Fixing above) to Ithome where I used ceph_objectstore_tool to remove all traces of 3.102 and 3.c7:
for i in 0 3 13 15 ; do for j in 3.c7 3.102 ; do ceph_objectstore_tool --data /var/lib/ceph/osd/ceph-$i --journal /var/lib/ceph/osd/ceph-$i/journal --op remove --pgid $j ; done ; done

Then I imported the PGs onto OSD.0 and OSD.15 with:
for i in 0 15 ; do for j in 3.c7 3.102 ; do ceph_objectstore_tool --data /var/lib/ceph/osd/ceph-$i --journal /var/lib/ceph/osd/ceph-$i/journal --op import --file ~/${j}.export ; done ; done
for i in 0 15 ; do ceph-osd -i $i --flush-journal && rm /var/log/ceph/osd/ceph-$i/journal ; done

Then I moved the disks back to Storage1 and started them all back up again. I think that this should have worked but what happened in this case was that OSD.0 didn't start up for some reason. I initially thought that that wouldn't matter because OSD.15 did start and so we should have had everything but a ceph pg query of the PGs showed something like:
"blocked": "peering is blocked due to down osds",
"down_osds_we_would_probe": [0],
"peering_blocked_by": [{
"osd": 0,
"current_lost_at": 0,
"comment": "starting or marking this osd lost may let us proceed"
}]

So I then removed OSD.0 from the cluster and everything came back to life. Thanks to Jean-Charles Lopez, Craig Lewis, and Paul Evans!

Paul Evans

2015-04-09 18:16:57 UTC

Permalink

Congrats Chris and nice "save" on that RBD!

--
Paul

> On Apr 9, 2015, at 11:11 AM, Chris Kitzmiller <***@hampshire.edu> wrote:
>
> Success! Hopefully my notes from the process will help:
>
> In the event of multiple disk failures the cluster could lose PGs. Should this occur it is best to attempt to restart the OSD process and have the drive marked as up+out. Marking the drive as out will cause data to flow off the drive to elsewhere in the cluster. In the event that the ceph-osd process is unable to keep running you could try using the ceph_objectstore_tool program to extract just the damaged PGs and import them into working PGs.
>
> Fixing Journals
> In this particular scenario things were complicated by the fact that ceph_objectstore_tool came out in Giant but we were running Firefly. Not wanting to upgrade the cluster in a degraded state this required that the OSD drives be moved to a different physical machine for repair. This added a lot of steps related to the journals but it wasn't a big deal. That process looks like:
>
> On Storage1:
> stop ceph-osd id=15
> ceph-osd -i 15 --flush-journal
> ls -l /var/lib/ceph/osd/ceph-15/journal
>
> Note the journal device UUID then pull the disk and move it to Ithome:
> rm /var/lib/ceph/osd/ceph-15/journal
> ceph-osd -i 15 --mkjournal
>
> That creates a colocated journal for which to use during the ceph_objectstore_tool commands. Once done then:
> ceph-osd -i 15 --flush-journal
> rm /var/lib/ceph/osd/ceph-15/journal
>
> Pull the disk and bring it back to Storage1. Then:
> ln -s /dev/disk/by-partitionuuid/b4f8d911-5ac9-4bf0-a06a-b8492e25a00f /var/lib/ceph/osd/ceph-15/journal
> ceph-osd -i 15 --mkjournal
> start ceph-osd id=15
>
> This all won't be needed once the cluster is running Hammer because then there will be an available version of ceph_objectstore_tool on the local machine and you can keep the journals throughout the process.
>
>
> Recovery Process
> We were missing two PGs, 3.c7 and 3.102. These PGs were hosted on OSD.0 and OSD.15 which were the two disks which failed out of Storage1. The disk for OSD.0 seemed to be a total loss while the disk for OSD.15 was somewhat more cooperative but not in a place to be up and running in the cluster. I took the dying OSD.15 drive and placed it into a new physical machine with a fresh install of Ceph Giant. Using Giant's ceph_objectstore_tool I was able to extract the PGs with a command like:
> for i in 3.c7 3.102 ; do ceph_objectstore_tool --data /var/lib/ceph/osd/ceph-15 --journal /var/lib/ceph/osd/ceph-15/journal --op export --pgid $i --file ~/${i}.export
>
> Once both PGs were successfully exported I attempted to import them into a new temporary OSD following instructions from here. For some reason that didn't work. The OSD was up+in but wasn't backfilling the PGs into the cluster. If you find yourself in this process I would try that first just in case it provides a cleaner process.
> Considering the above didn't work and we were looking at the possibility of losing the RBD volume (or perhaps worse, the potential of fruitlessly fscking 35TB) I took what I might describe as heroic measures:
>
> Running
> ceph pg dump | grep incomplete
>
> 3.c7 0 0 0 0 0 0 0 incomplete 2015-04-02 20:49:32.968841 0'0 15730:17 [15,0] 15 [15,0] 15 13985'54076 2015-03-31 19:14:22.721695 13985'54076 2015-03-31 19:14:22.721695
> 3.102 0 0 0 0 0 0 0 incomplete 2015-04-02 20:49:32.529594 0'0 15730:21 [0,15] 0 [0,15] 0 13985'53107 2015-03-29 21:17:15.568125 13985'49195 2015-03-24 18:38:08.244769
>
> Then I stopped all OSDs, which blocked all I/O to the cluster, with:
> stop ceph-osd-all
>
> Then I looked for all copies of the PG on all OSDs with:
> for i in 3.c7 3.102 ; do find /var/lib/ceph/osd/ -maxdepth 3 -type d -name "$i" ; done | sort -V
>
> /var/lib/ceph/osd/ceph-0/current/3.c7_head
> /var/lib/ceph/osd/ceph-0/current/3.102_head
> /var/lib/ceph/osd/ceph-3/current/3.c7_head
> /var/lib/ceph/osd/ceph-13/current/3.102_head
> /var/lib/ceph/osd/ceph-15/current/3.c7_head
> /var/lib/ceph/osd/ceph-15/current/3.102_head
>
> Then I flushed the journals for all of those OSDs with:
> for i in 0 3 13 15 ; do ceph-osd -i $i --flush-journal ; done
>
> Then I removed all of those drives and moved them (using Journal Fixing above) to Ithome where I used ceph_objectstore_tool to remove all traces of 3.102 and 3.c7:
> for i in 0 3 13 15 ; do for j in 3.c7 3.102 ; do ceph_objectstore_tool --data /var/lib/ceph/osd/ceph-$i --journal /var/lib/ceph/osd/ceph-$i/journal --op remove --pgid $j ; done ; done
>
> Then I imported the PGs onto OSD.0 and OSD.15 with:
> for i in 0 15 ; do for j in 3.c7 3.102 ; do ceph_objectstore_tool --data /var/lib/ceph/osd/ceph-$i --journal /var/lib/ceph/osd/ceph-$i/journal --op import --file ~/${j}.export ; done ; done
> for i in 0 15 ; do ceph-osd -i $i --flush-journal && rm /var/log/ceph/osd/ceph-$i/journal ; done
>
> Then I moved the disks back to Storage1 and started them all back up again. I think that this should have worked but what happened in this case was that OSD.0 didn't start up for some reason. I initially thought that that wouldn't matter because OSD.15 did start and so we should have had everything but a ceph pg query of the PGs showed something like:
> "blocked": "peering is blocked due to down osds",
> "down_osds_we_would_probe": [0],
> "peering_blocked_by": [{
> "osd": 0,
> "current_lost_at": 0,
> "comment": "starting or marking this osd lost may let us proceed"
> }]
>
> So I then removed OSD.0 from the cluster and everything came back to life. Thanks to Jean-Charles Lopez, Craig Lewis, and Paul Evans!