[ceph-users] Ceph PG Incomplete = Cluster unusable

Discussion:

Christian Eichelmann

2014-12-29 09:56:59 UTC

Hi all,

we have a ceph cluster, with currently 360 OSDs in 11 Systems. Last week
we were replacing one OSD System with a new one. During that, we had a
lot of problems with OSDs crashing on all of our systems. But that is
not our current problem.

After we got everything up and running again, we still have 3 PGs in the
state incomplete. I was checking one of them directly on the systems
(replication factor is 3). On two machines the directory was there but
empty, on the third one, I found some content. Using
ceph_objectstore_tool I exported this PG and imported it on the other
nodes. Nothing changed.

We only use ceph for providing rbd images. Right now, two of them are
unusable, because ceph hangs when someone trys to access content in
these pgs. Not bad enough, if I create a new rbd image, ceph is still
using the incomplete pgs, so it is a pure gambling if a new volume will
be usable or not. That, for now, makes our 900TB ceph cluster unusable
because of 3 bad PGs.

And right here it seems like I can't to anything. Instructing the ceph
cluster to scrub, deep-scrub or repair the pg does nothing, even after
several days. Checking which rbd images are affected is also not
possible, because rados -p poolname ls hangs forever when it comes to
one of the incomplete pgs. ceph osd lost also does actually nothing.

So right now, I am OK if I lose the content of these three PGs. So how
can I get the cluster back to live without deleting the whole pool which
is not for discussion?

Regards,
Christian

P.S.
We are using Giant

Chad William Seys

2014-12-29 22:09:50 UTC

Permalink

Hi Christian,
I had a similar problem about a month ago.
After trying lots of helpful suggestions, I found none of it worked and
I could only delete the affected pools and start over.

I opened a feature request in the tracker:
http://tracker.ceph.com/issues/10098

If you find a way, let us know!

Chad.

Andrey Korolyov

2014-12-29 22:22:04 UTC

Permalink

On Mon, Dec 29, 2014 at 12:56 PM, Christian Eichelmann

Post by Christian Eichelmann
Hi all,
we have a ceph cluster, with currently 360 OSDs in 11 Systems. Last week
we were replacing one OSD System with a new one. During that, we had a
lot of problems with OSDs crashing on all of our systems. But that is
not our current problem.
After we got everything up and running again, we still have 3 PGs in the
state incomplete. I was checking one of them directly on the systems
(replication factor is 3). On two machines the directory was there but
empty, on the third one, I found some content. Using
ceph_objectstore_tool I exported this PG and imported it on the other
nodes. Nothing changed.
We only use ceph for providing rbd images. Right now, two of them are
unusable, because ceph hangs when someone trys to access content in
these pgs. Not bad enough, if I create a new rbd image, ceph is still
using the incomplete pgs, so it is a pure gambling if a new volume will
be usable or not. That, for now, makes our 900TB ceph cluster unusable
because of 3 bad PGs.
And right here it seems like I can't to anything. Instructing the ceph
cluster to scrub, deep-scrub or repair the pg does nothing, even after
several days. Checking which rbd images are affected is also not
possible, because rados -p poolname ls hangs forever when it comes to
one of the incomplete pgs. ceph osd lost also does actually nothing.
So right now, I am OK if I lose the content of these three PGs. So how
can I get the cluster back to live without deleting the whole pool which
is not for discussion?

Christian, would you mind to provide an exact backtrace for those
crashes from core file? This one is clearly represents one of my worst
nightmares, domino crash of a healthy cluster and even for unstable
version such as Giant issue should be at least properly pinned. I also
suspect that you have an almost empty cluster or very low number of
volumes, as only two volumes are affected in your case. If you don`t
care about your data, after obtaining core dump you may want to try to
mark those pgs as lost, as operational guide suggests.

Alexandre Oliva

2014-12-30 00:49:02 UTC

Permalink

Post by Christian Eichelmann
After we got everything up and running again, we still have 3 PGs in the
state incomplete. I was checking one of them directly on the systems
(replication factor is 3).

I have run into this myself at least twice before. I had not lost or
replaced the OSDs altogether, though; I had just rolled too many of them
back to an earlier snapshots, which required them to be backfilled to
catch up. It looks like an OSD won't get out of incomplete state, even
to backfill others, if this would keep the PG active size under the min
size for the pool.

In my case, I brought the current-ish snapshot of the OSD back up to
enable backfilling of enough replicas, so that I could then roll the
remaining OSDs back again and have them backfilled too.

However, I suspect that temporarily setting min size to a lower number
could be enough for the PGs to recover. If "ceph osd pool <pool> set
min_size 1" doesn't get the PGs going, I suppose restarting at least one
of the OSDs involved in the recovery, so that they PG undergoes peering
again, would get you going again.

Once backfilling completes for all formerly-incomplete PGs, or maybe
even as soon as backfilling begins, bringing the pool min_size back up
to (presumably) 2 is advisable. You don't want to be running too long
with a too-low min size :-)

I hope this helps,

Happy GNU Year,

--
Alexandre Oliva, freedom fighter http://FSFLA.org/~lxoliva/
You must be the change you wish to see in the world. -- Gandhi
Be Free! -- http://FSFLA.org/ FSF Latin America board member
Free Software Evangelist|Red Hat Brasil GNU Toolchain Engineer

Craig Lewis

2015-01-08 01:07:46 UTC

Permalink

Post by Alexandre Oliva
However, I suspect that temporarily setting min size to a lower number
could be enough for the PGs to recover. If "ceph osd pool <pool> set
min_size 1" doesn't get the PGs going, I suppose restarting at least one
of the OSDs involved in the recovery, so that they PG undergoes peering
again, would get you going again.

It depends on how incomplete your incomplete PGs are.

min_size is defined as "Sets the minimum number of replicas required for
I/O.". By default, size is 3 and min_size is 2 on recent versions of ceph.

If the number of replicas you have drops below min_size, then Ceph will
mark the PG as incomplete. As long as you have one copy of the PG, you can
recover by lowering the min_size to the number of copies you do have, then
restoring the original value after recovery is complete. I did this last
week when I deleted the wrong PGs as part of a toofull experiment.

If the number of replicas drops to 0, I think you can use ceph pg
force_create_pg, but I haven't tested it.

Christian Balzer

2015-01-08 05:55:10 UTC

Permalink

Post by Craig Lewis

Post by Alexandre Oliva
However, I suspect that temporarily setting min size to a lower number
could be enough for the PGs to recover. If "ceph osd pool <pool> set
min_size 1" doesn't get the PGs going, I suppose restarting at least
one of the OSDs involved in the recovery, so that they PG undergoes
peering again, would get you going again.

It depends on how incomplete your incomplete PGs are.
min_size is defined as "Sets the minimum number of replicas required for
I/O.". By default, size is 3 and min_size is 2 on recent versions of ceph.
If the number of replicas you have drops below min_size, then Ceph will
mark the PG as incomplete. As long as you have one copy of the PG, you
can recover by lowering the min_size to the number of copies you do
have, then restoring the original value after recovery is complete. I
did this last week when I deleted the wrong PGs as part of a toofull
experiment.

Which of course begs the question of why not having min_size at 1
permanently, so that in the (hopefully rare) case of loosing 2 OSDs at the
same time your cluster still keeps working (as it should with a size of 3).

Christian

--
Christian Balzer Network/Systems Engineer
***@gol.com Global OnLine Japan/Fusion Communications
http://www.gol.com/

Robert LeBlanc

2015-01-08 18:41:37 UTC

Permalink

Post by Christian Balzer
Which of course begs the question of why not having min_size at 1
permanently, so that in the (hopefully rare) case of loosing 2 OSDs at the
same time your cluster still keeps working (as it should with a size of 3).

The idea is that when a write happens at least min_size has it
committed on disk before the write is committed back to the client.
Just in case something happens to the disk before it can be
replicated. It also goes against the strongly consistent model of
Ceph.

I believe there is work to resolve the issue when the number of
replicas drops below min_number. Ceph should automatically start
backfilling to get to at least min_num so that I/O can continue. I
believe this work is also tied to prioritizing backfilling so that
things like this are backfilled first, then backfilling min_num to get
back to size.

I am interested in a not-so-strict eventual consistency option in Ceph
so that under normal circumstances instead of needing [size] writes to
OSDs to complete, only [min_num] is needed and the primary OSD then
ensures that the laggy OSD(s) eventually gets the write committed.

Christian Balzer

2015-01-09 03:31:56 UTC

Permalink

Post by Robert LeBlanc

Post by Christian Balzer
Which of course begs the question of why not having min_size at 1
permanently, so that in the (hopefully rare) case of loosing 2 OSDs at
the same time your cluster still keeps working (as it should with a
size of 3).

Which of course currently means a strongly consistent lockup in these
scenarios. ^o^

Slightly off-topic and snarky, that strong consistency is of course of
limited use when in the case of a corrupted PG Ceph basically asks you to
toss a coin.
As in minor corruption, impossible for a mere human to tell which
replica is the good one, because one OSD is down and the 2 remaining ones
differ by one bit or so.

Post by Robert LeBlanc
I believe there is work to resolve the issue when the number of
replicas drops below min_number. Ceph should automatically start
backfilling to get to at least min_num so that I/O can continue. I
believe this work is also tied to prioritizing backfilling so that
things like this are backfilled first, then backfilling min_num to get
back to size.

Yeah, I suppose that is what Greg referred to.
Hopefully soon and backported if possible.

Post by Robert LeBlanc
I am interested in a not-so-strict eventual consistency option in Ceph
so that under normal circumstances instead of needing [size] writes to
OSDs to complete, only [min_num] is needed and the primary OSD then
ensures that the laggy OSD(s) eventually gets the write committed.

This is exactly where I was coming from/getting at.

And basically what artificially setting min size to 1 in a replica 3
cluster should get you, unless I'm missing something.

Christian

--
Christian Balzer Network/Systems Engineer
***@gol.com Global OnLine Japan/Fusion Communications
http://www.gol.com/

Robert LeBlanc

2015-01-09 04:17:12 UTC

Permalink

Post by Christian Balzer
Which of course currently means a strongly consistent lockup in these
scenarios. ^o^

That is one way of putting it

Post by Christian Balzer
Slightly off-topic and snarky, that strong consistency is of course of
limited use when in the case of a corrupted PG Ceph basically asks you to
toss a coin.
As in minor corruption, impossible for a mere human to tell which
replica is the good one, because one OSD is down and the 2 remaining ones
differ by one bit or so.

This is where checksumming is supposed to come in. I think Sage has been
leading that initiative. Basically, when an OSD reads an object it should
be able to tell if there was bit rot by hashing what it just read and
checking the MD5SUM that it did when it first received the object. If it
doesn't match it can ask another OSD until it finds one that matches.

This provides a number of benefits:

1. Protect against bit rot. Checked on read and on deep scrub.
2. Automatically recover the correct version of the object.
3. If the client computes the MD5SUM before it sent over the wire, the
data can be guaranteed through the memory of several
machines/devices/cables/etc.
4. Getting by with "size" 2 is less risky for those who really want to
do that.

With all these benefits, there is a trade-off associated with it, mostly
CPU. However with the inclusion of AES in silicon, it may not be a huge
issue now. But, I'm not a programmer and familiar with the aspect of the
Ceph code to be authoritative in any way.

Christian Balzer

2015-01-09 04:59:43 UTC

Permalink

Post by Robert LeBlanc

Post by Christian Balzer
Which of course currently means a strongly consistent lockup in these
scenarios. ^o^

That is one way of putting it

If I had the time and more importantly the talent to help with code, I'd
do so.
Failing that, pointing out the often painful truth is something I can do.

Post by Robert LeBlanc

Post by Christian Balzer
Slightly off-topic and snarky, that strong consistency is of course of
limited use when in the case of a corrupted PG Ceph basically asks you
to toss a coin.
As in minor corruption, impossible for a mere human to tell which
replica is the good one, because one OSD is down and the 2 remaining
ones differ by one bit or so.

This is where checksumming is supposed to come in. I think Sage has been
leading that initiative.

Yeah, I'm aware of that effort.
Of course in the meantime even a very simple majority vote would be most
welcome and helpful in nearly all cases (with 3 replicas available).

One wonders if this is basically acknowledging that while offloading some
things like checksums to the underlying layer/FS are desirable from a
codebase/effort/complexity view, neither BTRFS or ZFS are fully production
ready and won't be for some time.

Post by Robert LeBlanc
Basically, when an OSD reads an object it should
be able to tell if there was bit rot by hashing what it just read and
checking the MD5SUM that it did when it first received the object. If it
doesn't match it can ask another OSD until it finds one that matches.
1. Protect against bit rot. Checked on read and on deep scrub.
2. Automatically recover the correct version of the object.
3. If the client computes the MD5SUM before it sent over the wire, the
data can be guaranteed through the memory of several
machines/devices/cables/etc.
4. Getting by with "size" 2 is less risky for those who really want to
do that.
With all these benefits, there is a trade-off associated with it, mostly
CPU. However with the inclusion of AES in silicon, it may not be a huge
issue now. But, I'm not a programmer and familiar with the aspect of the
Ceph code to be authoritative in any way.

Yup, all very useful and pertinent points.

Christian

--
Christian Balzer Network/Systems Engineer
***@gol.com Global OnLine Japan/Fusion Communications
http://www.gol.com/

Andrey Korolyov

2015-01-09 11:34:02 UTC

Permalink

Post by Robert LeBlanc
Protect against bit rot. Checked on read and on deep scrub.

There are still issues (at least in firefly) with FDCache and scrub
completion having corrupted on-disk data, so throughout checksumming
will not cover every possible corruption case (at least not before
adding possibility to invalidate FDCache on demand). As long as this
topic rose consistency question, it worthy to mention this too. Most
times this issue will not hit anyone as hardware failures are going in
a different way than a single file corruption, but it is possible to
imagine such case, especially when dealing with SSDs. I suspect that
not anyone familiar with mentioned problem, in a short it looks that
we *may* corrupt certain data blocks in a filestore and due to FDCache
they will not be revealed even by a deep-scrub and problem may persist
upon OSD restart. This issue is very concerning for me after I hit
misbehavior of recover procedure in the middle of Dec, as mine issue
can possibly be related to one described above.

Gregory Farnum

2015-01-08 18:32:31 UTC

Permalink

Post by Christian Balzer

Post by Craig Lewis

Post by Alexandre Oliva
However, I suspect that temporarily setting min size to a lower number
could be enough for the PGs to recover. If "ceph osd pool <pool> set
min_size 1" doesn't get the PGs going, I suppose restarting at least
one of the OSDs involved in the recovery, so that they PG undergoes
peering again, would get you going again.

It depends on how incomplete your incomplete PGs are.
min_size is defined as "Sets the minimum number of replicas required for
I/O.". By default, size is 3 and min_size is 2 on recent versions of ceph.
If the number of replicas you have drops below min_size, then Ceph will
mark the PG as incomplete. As long as you have one copy of the PG, you
can recover by lowering the min_size to the number of copies you do
have, then restoring the original value after recovery is complete. I
did this last week when I deleted the wrong PGs as part of a toofull
experiment.

You no longer have write durability if you only have one copy of a PG.

Sam is fixing things up so that recovery will work properly as long as
you have a whole copy of the PG, which should make things behave as
people expect.
-Greg

Christian Eichelmann

2014-12-30 11:17:23 UTC

Permalink

Hi Nico and all others who answered,

After some more trying to somehow get the pgs in a working state (I've
tried force_create_pg, which was putting then in creating state. But
that was obviously not true, since after rebooting one of the containing
osd's it went back to incomplete), I decided to save what can be saved.

I've created a new pool, created a new image there, mapped the old image
from the old pool and the new image from the new pool to a machine, to
copy data on posix level.

Unfortunately, formatting the image from the new pool hangs after some
time. So it seems that the new pool is suffering from the same problem
as the old pool. Which is totaly not understandable for me.

Right now, it seems like Ceph is giving me no options to either save
some of the still intact rbd volumes, or to create a new pool along the
old one to at least enable our clients to send data to ceph again.

To tell the truth, I guess that will result in the end of our ceph
project (running for already 9 Monthes).

Regards,
Christian

Hey Christian,

[incomplete PG / RBD hanging, osd lost also not helping]

that is very interesting to hear, because we had a similar situation
with ceph 0.80.7 and had to re-create a pool, after I deleted 3 pg
directories to allow OSDs to start after the disk filled up completly.
So I am sorry not to being able to give you a good hint, but I am very
interested in seeing your problem solved, as it is a show stopper for
us, too. (*)
Cheers,
Nico
(*) We migrated from sheepdog to gluster to ceph and so far sheepdog
seems to run much smoother. The first one is however not supported
by opennebula directly, the second one not flexible enough to host
our heterogeneous infrastructure (mixed disk sizes/amounts) - so we
are using ceph at the moment.

--
Christian Eichelmann
Systemadministrator

1&1 Internet AG - IT Operations Mail & Media Advertising & Targeting
Brauerstraße 48 · DE-76135 Karlsruhe
Telefon: +49 721 91374-8026
***@1und1.de

Amtsgericht Montabaur / HRB 6484
Vorstände: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Robert
Hoffmann, Markus Huhn, Hans-Henning Kettler, Dr. Oliver Mauss, Jan Oetjen
Aufsichtsratsvorsitzender: Michael Scheeren

Eneko Lacunza

2014-12-30 11:23:40 UTC

Permalink

Hi Christian,

Have you tried to migrate the disk from the old storage (pool) to the
new one?

I think it should show the same problem, but I think it'd be a much
easier path to recover than the posix copy.

How full is your storage?

Maybe you can customize the crushmap, so that some OSDs are left in the
bad (default) pool, and other OSDs and set for the new pool. It think
(I'm yet learning ceph) that this will make different pgs for each pool,
also different OSDs, may be this way you can overcome the issue.

Cheers
Eneko

Post by Christian Eichelmann
Hi Nico and all others who answered,
After some more trying to somehow get the pgs in a working state (I've
tried force_create_pg, which was putting then in creating state. But
that was obviously not true, since after rebooting one of the containing
osd's it went back to incomplete), I decided to save what can be saved.
I've created a new pool, created a new image there, mapped the old image
from the old pool and the new image from the new pool to a machine, to
copy data on posix level.
Unfortunately, formatting the image from the new pool hangs after some
time. So it seems that the new pool is suffering from the same problem
as the old pool. Which is totaly not understandable for me.
Right now, it seems like Ceph is giving me no options to either save
some of the still intact rbd volumes, or to create a new pool along the
old one to at least enable our clients to send data to ceph again.
To tell the truth, I guess that will result in the end of our ceph
project (running for already 9 Monthes).
Regards,
Christian

Hey Christian,

[incomplete PG / RBD hanging, osd lost also not helping]

--
Zuzendari Teknikoa / Director Técnico
Binovo IT Human Project, S.L.
Telf. 943575997
943493611
Astigarraga bidea 2, planta 6 dcha., ofi. 3-2; 20180 Oiartzun (Gipuzkoa)
www.binovo.es

Christian Eichelmann

2014-12-30 11:31:37 UTC

Permalink

Hi Eneko,

I was trying a rbd cp before, but that was haning as well. But I
couldn't find out if the source image was causing the hang or the
destination image. That's why I decided to try a posix copy.

Our cluster is sill nearly empty (12TB / 867TB). But as far as I
understood (If not, somebody please correct me) placement groups are in
genereally not shared between pools at all.

Regards,
Christian

Post by Chad William Seys
Hi Christian,
Have you tried to migrate the disk from the old storage (pool) to the
new one?
I think it should show the same problem, but I think it'd be a much
easier path to recover than the posix copy.
How full is your storage?
Maybe you can customize the crushmap, so that some OSDs are left in the
bad (default) pool, and other OSDs and set for the new pool. It think
(I'm yet learning ceph) that this will make different pgs for each pool,
also different OSDs, may be this way you can overcome the issue.
Cheers
Eneko

Hey Christian,

[incomplete PG / RBD hanging, osd lost also not helping]

Eneko Lacunza

2014-12-30 11:33:29 UTC

Permalink

Hi Christian,

New pool's pgs also show as incomplete?

Did you notice something remarkable in ceph logs in the new pools image
format?

Post by Christian Eichelmann
Hi Eneko,
I was trying a rbd cp before, but that was haning as well. But I
couldn't find out if the source image was causing the hang or the
destination image. That's why I decided to try a posix copy.
Our cluster is sill nearly empty (12TB / 867TB). But as far as I
understood (If not, somebody please correct me) placement groups are in
genereally not shared between pools at all.
Regards,
Christian

Hey Christian,

[incomplete PG / RBD hanging, osd lost also not helping]

--
Zuzendari Teknikoa / Director Técnico
Binovo IT Human Project, S.L.
Telf. 943575997
943493611
Astigarraga bidea 2, planta 6 dcha., ofi. 3-2; 20180 Oiartzun (Gipuzkoa)
www.binovo.es

Christian Eichelmann

2014-12-30 11:39:52 UTC

Permalink

Hi Eneko,

nope, new pool has all pgs active+clean, not errors during image
creation. The format command just hangs, without error.

Post by Chad William Seys
Hi Christian,
New pool's pgs also show as incomplete?
Did you notice something remarkable in ceph logs in the new pools image
format?

Hey Christian,

[incomplete PG / RBD hanging, osd lost also not helping]

Nico Schottelius

2014-12-30 15:36:10 UTC

Permalink

Good evening,

we also tried to rescue data *from* our old / broken pool by map'ing the
rbd devices, mounting them on a host and rsync'ing away as much as
possible.

However, after some time rsync got completly stuck and eventually the
host which mounted the rbd mapped devices decided to kernel panic at
which time we decided to drop the pool and go with a backup.

This story and the one of Christian makes me wonder:

Is anyone using ceph as a backend for qemu VM images in production?

And:

Has anyone on the list been able to recover from a pg incomplete /
stuck situation like ours?

Reading about the issues on the list here gives me the impression that
ceph as a software is stuck/incomplete and has not yet become ready
"clean" for production (sorry for the word joke).

Cheers,

Nico

Hey Christian,

[incomplete PG / RBD hanging, osd lost also not helping]

--
Christian Eichelmann
Systemadministrator
1&1 Internet AG - IT Operations Mail & Media Advertising & Targeting
Brauerstraße 48 · DE-76135 Karlsruhe
Telefon: +49 721 91374-8026
Amtsgericht Montabaur / HRB 6484
Vorstände: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Robert
Hoffmann, Markus Huhn, Hans-Henning Kettler, Dr. Oliver Mauss, Jan Oetjen
Aufsichtsratsvorsitzender: Michael Scheeren

--
New PGP key: 659B 0D91 E86E 7E24 FD15 69D0 C729 21A1 293F 2D24

Lionel Bouton

2015-01-07 20:10:24 UTC

Permalink

Post by Nico Schottelius
Good evening,
we also tried to rescue data *from* our old / broken pool by map'ing the
rbd devices, mounting them on a host and rsync'ing away as much as
possible.
However, after some time rsync got completly stuck and eventually the
host which mounted the rbd mapped devices decided to kernel panic at
which time we decided to drop the pool and go with a backup.
Is anyone using ceph as a backend for qemu VM images in production?

Yes with Ceph 0.80.5 since September after extensive testing over
several months (including an earlier version IIRC) and some hardware
failure simulations. We plan to upgrade one storage host and one monitor
to 0.80.7 to validate this version over several months too before
migrating the others.

Post by Nico Schottelius
Has anyone on the list been able to recover from a pg incomplete /
stuck situation like ours?

Only by adding back an OSD with the data needed to reach min_size for
said pg, which is expected behavior. Even with some experimentations
with isolated unstable OSDs I've not yet witnessed a case where Ceph
lost multiple replicates simultaneously (we lost one OSD to disk failure
and another to a BTRFS bug but without trying to recover the filesystem
so we might have been able to recover this OSD).

If your setup is susceptible to situations where you can lose all
replicates you will lose data but there's not much that can be done
about that. Ceph actually begins to generate new replicates to replace
the missing onesafter"mon osd down out interval" so the actual loss
should not happen unless you lose (and can't recover) <size> OSDs on
separate hosts (with default crush map) simultaneously. Before going in
production you should know how long Ceph will take to fully recover from
a disk or host failure by testing it with load. Your setup might not be
robust if it hasn't the available disk space or the speed needed to
recover quickly from such a failure.

Lionel

Christian Eichelmann

2015-01-09 09:43:20 UTC

Permalink

Hi Lionel,

we have a ceph cluster with in sum about 1PB, 12 OSDs with 60 Disks,
devided into 4 racks in 2 rooms, all connected with a dedicated 10G
cluster network. Of course with a replication level of 3.

We did about 9 Month intensive testing. Just like you, we were never
experiences that kind of problems before. And incomplete PG was
recovering as soon as at least one OSD holding a copy of it came back up.

We still don't know what caused this specific error, but at no point
there were more than two hosts down at the same time. Our pool has a
min_size of 1. And after everything was up again, we had completely LOST
2 of 3 pg copies (the directories on the OSDs were empty) and the third
copy was obvioulsy broken, because even manually injecting this pg into
the other osds didn't changed anything.

My main problem here is, that with even one incomplete PG your pool is
rendered unusable. And there is currently no way to make ceph forget
about the data of this pg and create it as an empty one. So the only way
to make this pool usable again is to loose all your data in there. Which
for me is just not acceptable.

Regards,
Christian

Post by Lionel Bouton

Post by Nico Schottelius
Has anyone on the list been able to recover from a pg incomplete /
stuck situation like ours?

Only by adding back an OSD with the data needed to reach min_size for
said pg, which is expected behavior. Even with some experimentations
with isolated unstable OSDs I've not yet witnessed a case where Ceph
lost multiple replicates simultaneously (we lost one OSD to disk failure
and another to a BTRFS bug but without trying to recover the filesystem
so we might have been able to recover this OSD).
If your setup is susceptible to situations where you can lose all
replicates you will lose data but there's not much that can be done
about that. Ceph actually begins to generate new replicates to replace
the missing onesafter"mon osd down out interval" so the actual loss
should not happen unless you lose (and can't recover) <size> OSDs on
separate hosts (with default crush map) simultaneously. Before going in
production you should know how long Ceph will take to fully recover from
a disk or host failure by testing it with load. Your setup might not be
robust if it hasn't the available disk space or the speed needed to
recover quickly from such a failure.
Lionel
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Robert LeBlanc

2015-01-09 18:50:28 UTC

Permalink

On Fri, Jan 9, 2015 at 3:00 AM, Nico Schottelius

Even though I do not like the fact that we lost a pg for
an unknown reason, I would prefer ceph to handle that case to recover to
the best possible situation.
Namely I wonder if we can integrate a tool that shows
which (parts of) rbd images would be affected by dropping
a pg. That would give us the chance to selectively restore
VMs in case this happens again.

I have concerns about this as well. I would like to be able to backup
RBDs and if needed, overwrite them with an import. Currently this is
not available.

Gregory Farnum

2015-01-09 19:21:46 UTC

Permalink

On Fri, Jan 9, 2015 at 2:00 AM, Nico Schottelius

Lionel, Christian,
we do have the exactly same trouble as Christian,
namely

We still don't know what caused this specific error...

and

...there is currently no way to make ceph forget about the data of this pg and create it as an empty one. So the only way
to make this pool usable again is to loose all your data in there.

I wonder what is the position of ceph developers regarding
dropping (emptying) specific pgs?
Is that a use case that was never thought of or tested?

I've never worked directly on any of the cluster this has happened to,
but I believe every time we've seen issues like this with somebody we
have a relationship with it's either:
1) been resolved by using the existing tools to stuff lost, or
2) been the result of local filesystems/disks silently losing data due
to some fault or other.

The second case means the OSDs have corrupted state and trusting them
is tricky. Also, most people we've had relationships with that this
has happened to really want to not lose all the data in the PG, which
necessitates manually mucking around anyway. ;)

Mailing list issues are obviously a lot harder to categorize, but the
ones we've taken time on where people say the commands don't work have
generally fallen into the second bucket.

If you want to experiment, I think all the manual mucking around has
been done with the objectstore tool and removing bad PGs, moving them
around, or faking journal entries, but I've not done it myself so I
could be mistaken.
-Greg

flisky

2015-05-16 18:13:55 UTC

Permalink

Post by Gregory Farnum
On Fri, Jan 9, 2015 at 2:00 AM, Nico Schottelius

Lionel, Christian,
we do have the exactly same trouble as Christian,
namely

We still don't know what caused this specific error...

and

...there is currently no way to make ceph forget about the data of this pg and create it as an empty one. So the only way
to make this pool usable again is to loose all your data in there.

I wonder what is the position of ceph developers regarding
dropping (emptying) specific pgs?
Is that a use case that was never thought of or tested?

I've never worked directly on any of the cluster this has happened to,
but I believe every time we've seen issues like this with somebody we
1) been resolved by using the existing tools to stuff lost, or
2) been the result of local filesystems/disks silently losing data due
to some fault or other.
The second case means the OSDs have corrupted state and trusting them
is tricky. Also, most people we've had relationships with that this
has happened to really want to not lose all the data in the PG, which
necessitates manually mucking around anyway. ;)
Mailing list issues are obviously a lot harder to categorize, but the
ones we've taken time on where people say the commands don't work have
generally fallen into the second bucket.

Hi Gregory,

Post by Gregory Farnum
If you want to experiment, I think all the manual mucking around has
been done with the objectstore tool and removing bad PGs, moving them
around, or faking journal entries, but I've not done it myself so I
could be mistaken.
-Greg

I face the some problem(incomplete pg, and force_create_pg doesn't
help), and search the whole internet to get this.

I'm trying mocking ...

If the pgid is 12.bb1,

* service ceph stop osd.xx

* ceph-objectstore-tool --op export --pgid 12.bb1 --data-path
/var/lib/ceph/osd/ceph-xx/ --journal-path
/var/lib/ceph/osd/ceph-xx/journal --file 12.bb1.export

the structure 12.bb1_head/__head_00000BB1__c is zero size, which is made
by hand, but the exported file '12.bb1.export' contains some data, maybe
recovering from OSD journal.

* ceph-objectstore-tool --op import --data-path
/var/lib/ceph/osd/ceph-xx/ --journal-path
/var/lib/ceph/osd/ceph-xx/journal --file 12.bb1.export

succeed

* service ceph start osd.xx -- failed

traceback:

-8> 2015-05-17 01:49:16.658789 7f528608a880 10 osd.6 17288
build_past_intervals_parallel epoch 11356
-7> 2015-05-17 01:49:16.658902 7f528608a880 10 osd.6 0 add_map_bl
11356 65577 bytes
-6> 2015-05-17 01:49:16.659697 7f528608a880 10 osd.6 17288
build_past_intervals_parallel epoch 11356 pg 12.bb1
generate_past_intervals interval(11355-11355 up [11,13,5](6) acting
[5](6)): not rw, up_thru 11331 up_from 3916 last_epoch_clean 11276
generate_past_intervals interval(11355-11355 up [11,13,5](6) acting
[5](6)) : primary up 3916-11331 does not include interval

-5> 2015-05-17 01:49:16.659705 7f528608a880 10 osd.6 17288
build_past_intervals_parallel epoch 11357
-4> 2015-05-17 01:49:16.659852 7f528608a880 10 osd.6 0 add_map_bl
11357 70389 bytes
-3> 2015-05-17 01:49:16.660622 7f528608a880 10 osd.6 17288
build_past_intervals_parallel epoch 11357 pg 12.bb1
generate_past_intervals interval(11356-11356 up [11,13,5](6) acting
[5](6)): not rw, up_thru 11331 up_from 3916 last_epoch_clean 11276
generate_past_intervals interval(11356-11356 up [11,13,5](6) acting
[5](6)) : primary up 3916-11331 does not include interval

-2> 2015-05-17 01:49:16.660630 7f528608a880 10 osd.6 17288
build_past_intervals_parallel epoch 11358
-1> 2015-05-17 01:49:16.660751 7f528608a880 10 osd.6 0 add_map_bl
11358 70389 bytes
0> 2015-05-17 01:49:16.663571 7f528608a880 -1 osd/OSDMap.h: In
function 'const epoch_t& OSDMap::get_up_from(int) const' thread
7f528608a880 time 2015-05-17 01:49:16.661507
osd/OSDMap.h: 502: FAILED assert(exists(osd))

ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x85) [0xbc51f5]
2: /usr/bin/ceph-osd() [0x63d66c]
3: (pg_interval_t::check_new_interval(int, int, std::vector<int,
std::allocator<int> > const&, std::vector<int, std::allocator<int> >
const&, int, int, std::vector<int, std::allocator<int> > const&,
std::vector<int, std::allocator<int> > const&, unsigned int, unsigned
int, std::tr1::shared_ptr<OSDMap const>, std::tr1::shared_ptr<OSDMap
const>, pg_t, std::map<unsigned int, pg_interval_t, std::less<unsigned
int>, std::allocator<std::pair<unsigned int const, pg_interval_t> > >*,
std::ostream*)+0x605) [0x797745]
4: (OSD::build_past_intervals_parallel()+0x987) [0x69fb37]
5: (OSD::load_pgs()+0x19cf) [0x6b767f]
6: (OSD::init()+0x729) [0x6b8b99]
7: (main()+0x27f3) [0x643b63]
8: (__libc_start_main()+0xf5) [0x7f5283433af5]
9: /usr/bin/ceph-osd() [0x65cdc9]

Could you please tell me how to bypass the check_new_interval function?

Thanks!

Dan Van Der Ster

2015-01-07 20:12:29 UTC

Permalink

Hi Nico,
Yes Ceph is production ready. Yes people are using it in production for qemu. Last time I heard, Ceph was surveyed as the most popular backend for OpenStack Cinder in production.

When using RBD in production, it really is critically important to (a) use 3 replicas and (b) pay attention to pg distribution early on so that you don't end up with unbalanced OSDs.

Replication is especially important for RBD because you _must_not_ever_lose_an_entire_pg_. Parts of every single rbd device are stored on every single PG... So losing a PG means you lost random parts of every single block device. If this happens, the only safe course of action is to restore from backups. But the whole point of Ceph is that it enables you to configure adequate replication across failure domains, which makes this scenario very very very unlikely to occur.

I don't know why you were getting kernel panics. It's probably advisable to stick to the most recent mainline kernel when using kRBD.

Cheers, Dan

On 7 Jan 2015 20:45, Nico Schottelius <nico-eph-***@schottelius.org> wrote:
Good evening,

we also tried to rescue data *from* our old / broken pool by map'ing the
rbd devices, mounting them on a host and rsync'ing away as much as
possible.

However, after some time rsync got completly stuck and eventually the
host which mounted the rbd mapped devices decided to kernel panic at
which time we decided to drop the pool and go with a backup.

This story and the one of Christian makes me wonder:

Is anyone using ceph as a backend for qemu VM images in production?

And:

Has anyone on the list been able to recover from a pg incomplete /
stuck situation like ours?

Reading about the issues on the list here gives me the impression that
ceph as a software is stuck/incomplete and has not yet become ready
"clean" for production (sorry for the word joke).

Cheers,

Nico

Hey Christian,

[incomplete PG / RBD hanging, osd lost also not helping]

--
Christian Eichelmann
Systemadministrator
1&1 Internet AG - IT Operations Mail & Media Advertising & Targeting
BrauerstraÃe 48 Â· DE-76135 Karlsruhe
Telefon: +49 721 91374-8026
Amtsgericht Montabaur / HRB 6484
VorstÃ€nde: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Robert
Hoffmann, Markus Huhn, Hans-Henning Kettler, Dr. Oliver Mauss, Jan Oetjen
Aufsichtsratsvorsitzender: Michael Scheeren

--
New PGP key: 659B 0D91 E86E 7E24 FD15 69D0 C729 21A1 293F 2D24
_______________________________________________
ceph-users mailing list
ceph-***@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Jiri Kanicky

2015-01-08 23:44:33 UTC

Permalink

Hi Nico.

If you are experiencing such issues it would be good if you provide more info about your deployment: ceph version, kernel versions, OS, filesystem btrfs/xfs.

Thx Jiri

----- Reply message -----
From: "Nico Schottelius" <nico-eph-***@schottelius.org>
To: <ceph-***@lists.ceph.com>
Subject: [ceph-users] Is ceph production ready? [was: Ceph PG Incomplete = Cluster unusable]
Date: Wed, Dec 31, 2014 02:36

Good evening,

we also tried to rescue data *from* our old / broken pool by map'ing the
rbd devices, mounting them on a host and rsync'ing away as much as
possible.

However, after some time rsync got completly stuck and eventually the
host which mounted the rbd mapped devices decided to kernel panic at
which time we decided to drop the pool and go with a backup.

This story and the one of Christian makes me wonder:

Is anyone using ceph as a backend for qemu VM images in production?

And:

Has anyone on the list been able to recover from a pg incomplete /
stuck situation like ours?

Reading about the issues on the list here gives me the impression that
ceph as a software is stuck/incomplete and has not yet become ready
"clean" for production (sorry for the word joke).

Cheers,

Nico

Hey Christian,

[incomplete PG / RBD hanging, osd lost also not helping]

--
New PGP key: 659B 0D91 E86E 7E24 FD15 69D0 C729 21A1 293F 2D24

Jiri Kanicky

2015-01-09 15:30:04 UTC

Permalink

Hi Nico,

I would probably recommend to upgrade to 0.87 (giant). I am running this
version for some time now and it works very well. I also upgraded from
firefly and it was easy.

The issue you are experiencing seems quite complex and it would require
debug logs to troubleshoot.

Apology that I did not help much.

-Jiri

Good morning Jiri,
- Kernel 3.16
- ceph: 0.80.7
- fs: xfs
- os: debian (backports) (1x)/ubuntu (2x)
Cheers,
Nico

Post by Jiri Kanicky
Hi Nico.
If you are experiencing such issues it would be good if you provide more info about your deployment: ceph version, kernel versions, OS, filesystem btrfs/xfs.
Thx Jiri
----- Reply message -----
Subject: [ceph-users] Is ceph production ready? [was: Ceph PG Incomplete = Cluster unusable]
Date: Wed, Dec 31, 2014 02:36
Good evening,
we also tried to rescue data *from* our old / broken pool by map'ing the
rbd devices, mounting them on a host and rsync'ing away as much as
possible.
However, after some time rsync got completly stuck and eventually the
host which mounted the rbd mapped devices decided to kernel panic at
which time we decided to drop the pool and go with a backup.
Is anyone using ceph as a backend for qemu VM images in production?
Has anyone on the list been able to recover from a pg incomplete /
stuck situation like ours?
Reading about the issues on the list here gives me the impression that
ceph as a software is stuck/incomplete and has not yet become ready
"clean" for production (sorry for the word joke).
Cheers,
Nico

Hey Christian,

[incomplete PG / RBD hanging, osd lost also not helping]

--
Christian Eichelmann
Systemadministrator
1&1 Internet AG - IT Operations Mail & Media Advertising & Targeting
Brauerstraße 48 · DE-76135 Karlsruhe
Telefon: +49 721 91374-8026
Amtsgericht Montabaur / HRB 6484
Vorstände: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Robert
Hoffmann, Markus Huhn, Hans-Henning Kettler, Dr. Oliver Mauss, Jan Oetjen
Aufsichtsratsvorsitzender: Michael Scheeren

--
New PGP key: 659B 0D91 E86E 7E24 FD15 69D0 C729 21A1 293F 2D24
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com