Discussion:
[ceph-users] Degraded objects afte: ceph osd in $osd
Stefan Kooman
2018-11-25 19:53:57 UTC
Permalink
Hi List,

Another interesting and unexpected thing we observed during cluster
expansion is the following. After we added extra disks to the cluster,
while "norebalance" flag was set, we put the new OSDs "IN". As soon as
we did that a couple of hundered objects would become degraded. During
that time no OSD crashed or restarted. Every "ceph osd crush add $osd
weight host=$storage-node" would cause extra degraded objects.

I don't expect objects to become degraded when extra OSDs are added.
Misplaced, yes. Degraded, no

Someone got an explantion for this?

Gr. Stefan
--
| BIT BV http://www.bit.nl/ Kamer van Koophandel 09090351
| GPG: 0xD14839C6 +31 318 648 688 / ***@bit.nl
Janne Johansson
2018-11-26 08:30:26 UTC
Permalink
Post by Stefan Kooman
Hi List,
Another interesting and unexpected thing we observed during cluster
expansion is the following. After we added extra disks to the cluster,
while "norebalance" flag was set, we put the new OSDs "IN". As soon as
we did that a couple of hundered objects would become degraded. During
that time no OSD crashed or restarted. Every "ceph osd crush add $osd
weight host=$storage-node" would cause extra degraded objects.
I don't expect objects to become degraded when extra OSDs are added.
Misplaced, yes. Degraded, no
Someone got an explantion for this?
Yes, when you add a drive (or 10), some PGs decide they should have one or more
replicas on the new drives, a new empty PG is created there, and
_then_ that replica
will make that PG get into the "degraded" mode, meaning if it had 3
fine active+clean
replicas before, it now has 2 active+clean and one needing backfill to
get into shape.

It is a slight mistake in reporting it in the same way as an error,
even if it looks to the
cluster just as if it was in error and needs fixing. This gives the
new ceph admins a
sense of urgency or danger whereas it should be perfectly normal to add space to
a cluster. Also, it could have chosen to add a fourth PG in a repl=3
PG and fill from
the one going out into the new empty PG and somehow keep itself with 3 working
replicas, but ceph chooses to first discard one replica, then backfill
into the empty
one, leading to this kind of "error" report.
--
May the most significant bit of your life be positive.
Stefan Kooman
2018-11-26 08:39:13 UTC
Permalink
Post by Janne Johansson
Yes, when you add a drive (or 10), some PGs decide they should have one or more
replicas on the new drives, a new empty PG is created there, and
_then_ that replica
will make that PG get into the "degraded" mode, meaning if it had 3
fine active+clean
replicas before, it now has 2 active+clean and one needing backfill to
get into shape.
It is a slight mistake in reporting it in the same way as an error,
even if it looks to the
cluster just as if it was in error and needs fixing. This gives the
new ceph admins a
sense of urgency or danger whereas it should be perfectly normal to add space to
a cluster. Also, it could have chosen to add a fourth PG in a repl=3
PG and fill from
the one going out into the new empty PG and somehow keep itself with 3 working
replicas, but ceph chooses to first discard one replica, then backfill
into the empty
one, leading to this kind of "error" report.
Thanks for the explanation. I agree with you that it would be more safe to
first backfill to the new PG instead of just assuming the new OSD will
be fine and discarding a perfectly healthy PG. We do have max_size 3 in
the CRUSH ruleset ... I wonder if Ceph would behave differently if we
would have max_size 4 ... to actually allow a fourth copy in the first
place ...

Gr. Stefan
--
| BIT BV http://www.bit.nl/ Kamer van Koophandel 09090351
| GPG: 0xD14839C6 +31 318 648 688 / ***@bit.nl
Janne Johansson
2018-11-26 08:52:11 UTC
Permalink
Post by Stefan Kooman
Post by Janne Johansson
It is a slight mistake in reporting it in the same way as an error,
even if it looks to the
cluster just as if it was in error and needs fixing. This gives the
new ceph admins a
sense of urgency or danger whereas it should be perfectly normal to add space to
a cluster. Also, it could have chosen to add a fourth PG in a repl=3
PG and fill from
the one going out into the new empty PG and somehow keep itself with 3 working
replicas, but ceph chooses to first discard one replica, then backfill
into the empty
one, leading to this kind of "error" report.
Thanks for the explanation. I agree with you that it would be more safe to
first backfill to the new PG instead of just assuming the new OSD will
be fine and discarding a perfectly healthy PG. We do have max_size 3 in
the CRUSH ruleset ... I wonder if Ceph would behave differently if we
would have max_size 4 ... to actually allow a fourth copy in the first
place ...
I don't think the replication number is important, it's more of a choice which
PERHAPS is meant to allow you to move PGs to a new drive when the cluster is
near full, since it will clear out space lots faster if you just kill
off one unneeded
replica and starts writing to a new drive, whereas keeping all old
replicas until data is
100% ok on the new replica will make new space not appear until a large
amount of data has moved, which for large drives and large PGs might take
a very long time.
--
May the most significant bit of your life be positive.
Marco Gaiarin
2018-11-26 11:40:06 UTC
Permalink
Mandi! Janne Johansson
In chel di` si favelave...
It is a slight mistake in reporting it in the same way as an error, even if it looks to the
cluster just as if it was in error and needs fixing.
I think i'm hit a similar situation, and also i'm feeling that
something have to be 'fixed'. I seek an explanation...

I'm adding a node (blackpanther, 4 OSDs, done) and removing a
node (vedovanera[1], 4 OSDs, to be done).

I've added a new node, added slowly 4 new OSD, but in the meantime an
OSD (not the new, not the node to remove) died. My situation now is:

***@blackpanther:~# ceph osd df tree
ID WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR TYPE NAME
-1 21.41985 - 5586G 2511G 3074G 0 0 root default
-2 5.45996 - 5586G 2371G 3214G 42.45 0.93 host capitanamerica
0 1.81999 1.00000 1862G 739G 1122G 39.70 0.87 osd.0
1 1.81999 1.00000 1862G 856G 1005G 46.00 1.00 osd.1
10 0.90999 1.00000 931G 381G 549G 40.95 0.89 osd.10
11 0.90999 1.00000 931G 394G 536G 42.35 0.92 osd.11
-3 5.03996 - 5586G 2615G 2970G 46.82 1.02 host vedovanera
2 1.39999 1.00000 1862G 684G 1177G 36.78 0.80 osd.2
3 1.81999 1.00000 1862G 1081G 780G 58.08 1.27 osd.3
4 0.90999 1.00000 931G 412G 518G 44.34 0.97 osd.4
5 0.90999 1.00000 931G 436G 494G 46.86 1.02 osd.5
-4 5.45996 - 931G 583G 347G 0 0 host deadpool
6 1.81999 1.00000 1862G 898G 963G 48.26 1.05 osd.6
7 1.81999 1.00000 1862G 839G 1022G 45.07 0.98 osd.7
8 0.90999 0 0 0 0 0 0 osd.8
9 0.90999 1.00000 931G 583G 347G 62.64 1.37 osd.9
-5 5.45996 - 5586G 2511G 3074G 44.96 0.98 host blackpanther
12 1.81999 1.00000 1862G 828G 1033G 44.51 0.97 osd.12
13 1.81999 1.00000 1862G 753G 1108G 40.47 0.88 osd.13
14 0.90999 1.00000 931G 382G 548G 41.11 0.90 osd.14
15 0.90999 1.00000 931G 546G 384G 58.66 1.28 osd.15
TOTAL 21413G 9819G 11594G 45.85
MIN/MAX VAR: 0/1.37 STDDEV: 7.37

Perfectly healthy. But i've tried to, slowly, remove an OSD from
'vedovanera', and so i've tried with:

ceph osd crush reweight osd.2 <weight>

as you can see, i'm arrived to weight 1.4 (from 1.81999), but if i go
lower than that i catch:

cluster 8794c124-c2ec-4e81-8631-742992159bd6
health HEALTH_WARN
6 pgs backfill
1 pgs backfilling
7 pgs stuck unclean
recovery 2/2556513 objects degraded (0.000%)
recovery 7721/2556513 objects misplaced (0.302%)
monmap e6: 6 mons at {0=10.27.251.7:6789/0,1=10.27.251.8:6789/0,2=10.27.251.11:6789/0,3=10.27.251.12:6789/0,4=10.27.251.9:6789/0,blackpanther=10.27.251.2:6789/0}
election epoch 2780, quorum 0,1,2,3,4,5 blackpanther,0,1,4,2,3
osdmap e9302: 16 osds: 15 up, 15 in; 7 remapped pgs
pgmap v54971897: 768 pgs, 3 pools, 3300 GB data, 830 kobjects
9911 GB used, 11502 GB / 21413 GB avail
2/2556513 objects degraded (0.000%)
7721/2556513 objects misplaced (0.302%)
761 active+clean
6 active+remapped+wait_backfill
1 active+remapped+backfilling
client io 9725 kB/s rd, 772 kB/s wr, 153 op/s

eg, 2 object 'degraded'. This really puzzled me.

Why?! Thanks.


[1] some Marvel Comics heros got translated in Italian, so 'vedovanera'
is 'black widow' and 'capitanamerica' clearly 'Captain America'.
--
dott. Marco Gaiarin GNUPG Key ID: 240A3D66
Associazione ``La Nostra Famiglia'' http://www.lanostrafamiglia.it/
Polo FVG - Via della Bontà, 7 - 33078 - San Vito al Tagliamento (PN)
marco.gaiarin(at)lanostrafamiglia.it t +39-0434-842711 f +39-0434-842797

Dona il 5 PER MILLE a LA NOSTRA FAMIGLIA!
http://www.lanostrafamiglia.it/index.php/it/sostienici/5x1000
(cf 00307430132, categoria ONLUS oppure RICERCA SANITARIA)
Marco Gaiarin
2018-11-29 08:59:10 UTC
Permalink
I reply to myself.
Post by Marco Gaiarin
I've added a new node, added slowly 4 new OSD, but in the meantime an
ID WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR TYPE NAME
-1 21.41985 - 5586G 2511G 3074G 0 0 root default
-2 5.45996 - 5586G 2371G 3214G 42.45 0.93 host capitanamerica
0 1.81999 1.00000 1862G 739G 1122G 39.70 0.87 osd.0
1 1.81999 1.00000 1862G 856G 1005G 46.00 1.00 osd.1
10 0.90999 1.00000 931G 381G 549G 40.95 0.89 osd.10
11 0.90999 1.00000 931G 394G 536G 42.35 0.92 osd.11
-3 5.03996 - 5586G 2615G 2970G 46.82 1.02 host vedovanera
2 1.39999 1.00000 1862G 684G 1177G 36.78 0.80 osd.2
3 1.81999 1.00000 1862G 1081G 780G 58.08 1.27 osd.3
4 0.90999 1.00000 931G 412G 518G 44.34 0.97 osd.4
5 0.90999 1.00000 931G 436G 494G 46.86 1.02 osd.5
-4 5.45996 - 931G 583G 347G 0 0 host deadpool
6 1.81999 1.00000 1862G 898G 963G 48.26 1.05 osd.6
7 1.81999 1.00000 1862G 839G 1022G 45.07 0.98 osd.7
8 0.90999 0 0 0 0 0 0 osd.8
9 0.90999 1.00000 931G 583G 347G 62.64 1.37 osd.9
-5 5.45996 - 5586G 2511G 3074G 44.96 0.98 host blackpanther
12 1.81999 1.00000 1862G 828G 1033G 44.51 0.97 osd.12
13 1.81999 1.00000 1862G 753G 1108G 40.47 0.88 osd.13
14 0.90999 1.00000 931G 382G 548G 41.11 0.90 osd.14
15 0.90999 1.00000 931G 546G 384G 58.66 1.28 osd.15
TOTAL 21413G 9819G 11594G 45.85
MIN/MAX VAR: 0/1.37 STDDEV: 7.37
Perfectly healthy. But i've tried to, slowly, remove an OSD from
ceph osd crush reweight osd.2 <weight>
as you can see, i'm arrived to weight 1.4 (from 1.81999), but if i go
[...]
Post by Marco Gaiarin
recovery 2/2556513 objects degraded (0.000%)
Seems that the trouble came from osd.8 that was out and down, but not
from the crushmap (still have weight 0.90999).

After removing osd 8 massive rebalance start. After that, now i can
lower weight of OSD for node vedovanera and i've no more degraded
object.

I think i'm starting to understand how concretely the crush algorithm
work. ;-)
--
dott. Marco Gaiarin GNUPG Key ID: 240A3D66
Associazione ``La Nostra Famiglia'' http://www.lanostrafamiglia.it/
Polo FVG - Via della Bontà, 7 - 33078 - San Vito al Tagliamento (PN)
marco.gaiarin(at)lanostrafamiglia.it t +39-0434-842711 f +39-0434-842797

Dona il 5 PER MILLE a LA NOSTRA FAMIGLIA!
http://www.lanostrafamiglia.it/index.php/it/sostienici/5x1000
(cf 00307430132, categoria ONLUS oppure RICERCA SANITARIA)
Gregory Farnum
2018-11-26 13:10:12 UTC
Permalink
Post by Janne Johansson
Post by Stefan Kooman
Hi List,
Another interesting and unexpected thing we observed during cluster
expansion is the following. After we added extra disks to the cluster,
while "norebalance" flag was set, we put the new OSDs "IN". As soon as
we did that a couple of hundered objects would become degraded. During
that time no OSD crashed or restarted. Every "ceph osd crush add $osd
weight host=$storage-node" would cause extra degraded objects.
I don't expect objects to become degraded when extra OSDs are added.
Misplaced, yes. Degraded, no
Someone got an explantion for this?
Yes, when you add a drive (or 10), some PGs decide they should have one or more
replicas on the new drives, a new empty PG is created there, and
_then_ that replica
will make that PG get into the "degraded" mode, meaning if it had 3
fine active+clean
replicas before, it now has 2 active+clean and one needing backfill to
get into shape.
It is a slight mistake in reporting it in the same way as an error,
even if it looks to the
cluster just as if it was in error and needs fixing. This gives the
new ceph admins a
sense of urgency or danger whereas it should be perfectly normal to add space to
a cluster. Also, it could have chosen to add a fourth PG in a repl=3
PG and fill from
the one going out into the new empty PG and somehow keep itself with 3 working
replicas, but ceph chooses to first discard one replica, then backfill
into the empty
one, leading to this kind of "error" report.
See, that's the thing: Ceph is designed *not* to reduce data reliability
this way; it shouldn't do that; and so far as I've been able to establish
so far it doesn't actually do that. Which makes these degraded object
reports a bit perplexing.

What we have worked out is that sometimes objects can be degraded because
the log-based recovery takes a while after the primary juggles around PG
set membership, and I suspect that's what is turning up here. The exact
cause still eludes me a bit, but I assume it's a consequence of the
backfill and recovery throttling we've added over the years.
If a whole PG was missing then you'd expect to see very large degraded
object counts (as opposed to the 2 that Marco reported).

-Greg
Post by Janne Johansson
--
May the most significant bit of your life be positive.
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Loading...