Discussion:
All PGs are active+clean, still remapped PGs
(too old to reply)
Wido den Hollander
2016-10-24 20:19:00 UTC
Permalink
Hi,

On a cluster running Hammer 0.94.9 (upgraded from Firefly) I have 29 remapped PGs according to the OSDMap, but all PGs are active+clean.

osdmap e111208: 171 osds: 166 up, 166 in; 29 remapped pgs

pgmap v101069070: 6144 pgs, 2 pools, 90122 GB data, 22787 kobjects
264 TB used, 184 TB / 448 TB avail
6144 active+clean

The OSDMap shows:

***@mon1:~# ceph osd dump|grep pg_temp
pg_temp 4.39 [160,17,10,8]
pg_temp 4.52 [161,16,10,11]
pg_temp 4.8b [166,29,10,7]
pg_temp 4.b1 [5,162,148,2]
pg_temp 4.168 [95,59,6,2]
pg_temp 4.1ef [22,162,10,5]
pg_temp 4.2c9 [164,95,10,7]
pg_temp 4.330 [165,154,10,8]
pg_temp 4.353 [2,33,18,54]
pg_temp 4.3f8 [88,67,10,7]
pg_temp 4.41a [30,59,10,5]
pg_temp 4.45f [47,156,21,2]
pg_temp 4.486 [138,43,10,7]
pg_temp 4.674 [59,18,7,2]
pg_temp 4.7b8 [164,68,10,11]
pg_temp 4.816 [167,147,57,2]
pg_temp 4.829 [82,45,10,11]
pg_temp 4.843 [141,34,10,6]
pg_temp 4.862 [31,160,138,2]
pg_temp 4.868 [78,67,10,5]
pg_temp 4.9ca [150,68,10,8]
pg_temp 4.a83 [156,83,10,7]
pg_temp 4.a98 [161,94,10,7]
pg_temp 4.b80 [162,88,10,8]
pg_temp 4.d41 [163,52,10,6]
pg_temp 4.d54 [149,140,10,7]
pg_temp 4.e8e [164,78,10,8]
pg_temp 4.f2a [150,68,10,6]
pg_temp 4.ff3 [30,157,10,7]
***@mon1:~#

So I tried to restart osd.160 and osd.161, but that didn't chance the state.

***@mon1:~# ceph pg 4.39 query
{
"state": "active+clean",
"snap_trimq": "[]",
"epoch": 111212,
"up": [
160,
17,
8
],
"acting": [
160,
17,
8
],
"actingbackfill": [
"8",
"17",
"160"
],

In all these PGs osd.10 is involved, but that OSD is down and out. I tried marking it as down again, but that didn't work.

I haven't tried removing osd.10 yet from the CRUSHMap since that will trigger a rather large rebalance.

This cluster is still running with the Dumpling tunables though, so that might be the issue. But before I trigger a very large rebalance I wanted to check if there are any insights on this one.

Thanks,

Wido
David Turner
2016-10-24 20:24:51 UTC
Permalink
Are you running a replica size of 4? If not, these might be errantly being reported as being on 10.

________________________________

[cid:***@a622f997.4d830ea4]<https://storagecraft.com> David Turner | Cloud Operations Engineer | StorageCraft Technology Corporation<https://storagecraft.com>
380 Data Drive Suite 300 | Draper | Utah | 84020
Office: 801.871.2760 | Mobile: 385.224.2943

________________________________

If you are not the intended recipient of this message or received it erroneously, please notify the sender and delete it, together with any attachments, and be advised that any dissemination or copying of this message is prohibited.

________________________________

________________________________________
From: ceph-users [ceph-users-***@lists.ceph.com] on behalf of Wido den Hollander [***@42on.com]
Sent: Monday, October 24, 2016 2:19 PM
To: ceph-***@ceph.com
Subject: [ceph-users] All PGs are active+clean, still remapped PGs

Hi,

On a cluster running Hammer 0.94.9 (upgraded from Firefly) I have 29 remapped PGs according to the OSDMap, but all PGs are active+clean.

osdmap e111208: 171 osds: 166 up, 166 in; 29 remapped pgs

pgmap v101069070: 6144 pgs, 2 pools, 90122 GB data, 22787 kobjects
264 TB used, 184 TB / 448 TB avail
6144 active+clean

The OSDMap shows:

***@mon1:~# ceph osd dump|grep pg_temp
pg_temp 4.39 [160,17,10,8]
pg_temp 4.52 [161,16,10,11]
pg_temp 4.8b [166,29,10,7]
pg_temp 4.b1 [5,162,148,2]
pg_temp 4.168 [95,59,6,2]
pg_temp 4.1ef [22,162,10,5]
pg_temp 4.2c9 [164,95,10,7]
pg_temp 4.330 [165,154,10,8]
pg_temp 4.353 [2,33,18,54]
pg_temp 4.3f8 [88,67,10,7]
pg_temp 4.41a [30,59,10,5]
pg_temp 4.45f [47,156,21,2]
pg_temp 4.486 [138,43,10,7]
pg_temp 4.674 [59,18,7,2]
pg_temp 4.7b8 [164,68,10,11]
pg_temp 4.816 [167,147,57,2]
pg_temp 4.829 [82,45,10,11]
pg_temp 4.843 [141,34,10,6]
pg_temp 4.862 [31,160,138,2]
pg_temp 4.868 [78,67,10,5]
pg_temp 4.9ca [150,68,10,8]
pg_temp 4.a83 [156,83,10,7]
pg_temp 4.a98 [161,94,10,7]
pg_temp 4.b80 [162,88,10,8]
pg_temp 4.d41 [163,52,10,6]
pg_temp 4.d54 [149,140,10,7]
pg_temp 4.e8e [164,78,10,8]
pg_temp 4.f2a [150,68,10,6]
pg_temp 4.ff3 [30,157,10,7]
***@mon1:~#

So I tried to restart osd.160 and osd.161, but that didn't chance the state.

***@mon1:~# ceph pg 4.39 query
{
"state": "active+clean",
"snap_trimq": "[]",
"epoch": 111212,
"up": [
160,
17,
8
],
"acting": [
160,
17,
8
],
"actingbackfill": [
"8",
"17",
"160"
],

In all these PGs osd.10 is involved, but that OSD is down and out. I tried marking it as down again, but that didn't work.

I haven't tried removing osd.10 yet from the CRUSHMap since that will trigger a rather large rebalance.

This cluster is still running with the Dumpling tunables though, so that might be the issue. But before I trigger a very large rebalance I wanted to check if there are any insights on this one.

Thanks,

Wido
David Turner
2016-10-24 20:41:08 UTC
Permalink
More to my curiosity on this. Our clusters leave behind /var/lib/ceph/osd/ceph-##/current/pg_temp folders on occasion. if you check all of the pg_temp folders for osd.10, you might find something that's holding onto the pg even if it's really moved on.

________________________________

[cid:***@aa1a4c35.419d1b46]<https://storagecraft.com> David Turner | Cloud Operations Engineer | StorageCraft Technology Corporation<https://storagecraft.com>
380 Data Drive Suite 300 | Draper | Utah | 84020
Office: 801.871.2760 | Mobile: 385.224.2943

________________________________

If you are not the intended recipient of this message or received it erroneously, please notify the sender and delete it, together with any attachments, and be advised that any dissemination or copying of this message is prohibited.

________________________________

________________________________
From: ceph-users [ceph-users-***@lists.ceph.com] on behalf of David Turner [***@storagecraft.com]
Sent: Monday, October 24, 2016 2:24 PM
To: Wido den Hollander; ceph-***@ceph.com
Subject: Re: [ceph-users] All PGs are active+clean, still remapped PGs


Are you running a replica size of 4? If not, these might be errantly being reported as being on 10.

________________________________

[cid:***@a622f997.4d830ea4]<https://storagecraft.com> David Turner | Cloud Operations Engineer | StorageCraft Technology Corporation<https://storagecraft.com>
380 Data Drive Suite 300 | Draper | Utah | 84020
Office: 801.871.2760 | Mobile: 385.224.2943

________________________________

If you are not the intended recipient of this message or received it erroneously, please notify the sender and delete it, together with any attachments, and be advised that any dissemination or copying of this message is prohibited.

________________________________

________________________________________
From: ceph-users [ceph-users-***@lists.ceph.com] on behalf of Wido den Hollander [***@42on.com]
Sent: Monday, October 24, 2016 2:19 PM
To: ceph-***@ceph.com
Subject: [ceph-users] All PGs are active+clean, still remapped PGs

Hi,

On a cluster running Hammer 0.94.9 (upgraded from Firefly) I have 29 remapped PGs according to the OSDMap, but all PGs are active+clean.

osdmap e111208: 171 osds: 166 up, 166 in; 29 remapped pgs

pgmap v101069070: 6144 pgs, 2 pools, 90122 GB data, 22787 kobjects
264 TB used, 184 TB / 448 TB avail
6144 active+clean

The OSDMap shows:

***@mon1:~# ceph osd dump|grep pg_temp
pg_temp 4.39 [160,17,10,8]
pg_temp 4.52 [161,16,10,11]
pg_temp 4.8b [166,29,10,7]
pg_temp 4.b1 [5,162,148,2]
pg_temp 4.168 [95,59,6,2]
pg_temp 4.1ef [22,162,10,5]
pg_temp 4.2c9 [164,95,10,7]
pg_temp 4.330 [165,154,10,8]
pg_temp 4.353 [2,33,18,54]
pg_temp 4.3f8 [88,67,10,7]
pg_temp 4.41a [30,59,10,5]
pg_temp 4.45f [47,156,21,2]
pg_temp 4.486 [138,43,10,7]
pg_temp 4.674 [59,18,7,2]
pg_temp 4.7b8 [164,68,10,11]
pg_temp 4.816 [167,147,57,2]
pg_temp 4.829 [82,45,10,11]
pg_temp 4.843 [141,34,10,6]
pg_temp 4.862 [31,160,138,2]
pg_temp 4.868 [78,67,10,5]
pg_temp 4.9ca [150,68,10,8]
pg_temp 4.a83 [156,83,10,7]
pg_temp 4.a98 [161,94,10,7]
pg_temp 4.b80 [162,88,10,8]
pg_temp 4.d41 [163,52,10,6]
pg_temp 4.d54 [149,140,10,7]
pg_temp 4.e8e [164,78,10,8]
pg_temp 4.f2a [150,68,10,6]
pg_temp 4.ff3 [30,157,10,7]
***@mon1:~#

So I tried to restart osd.160 and osd.161, but that didn't chance the state.

***@mon1:~# ceph pg 4.39 query
{
"state": "active+clean",
"snap_trimq": "[]",
"epoch": 111212,
"up": [
160,
17,
8
],
"acting": [
160,
17,
8
],
"actingbackfill": [
"8",
"17",
"160"
],

In all these PGs osd.10 is involved, but that OSD is down and out. I tried marking it as down again, but that didn't work.

I haven't tried removing osd.10 yet from the CRUSHMap since that will trigger a rather large rebalance.

This cluster is still running with the Dumpling tunables though, so that might be the issue. But before I trigger a very large rebalance I wanted to check if there are any insights on this one.

Thanks,

Wido
_______________________________________________
ceph-users mailing list
ceph-***@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Wido den Hollander
2016-10-25 05:18:17 UTC
Permalink
> Op 24 oktober 2016 om 22:41 schreef David Turner <***@storagecraft.com>:
>
>
> More to my curiosity on this. Our clusters leave behind /var/lib/ceph/osd/ceph-##/current/pg_temp folders on occasion. if you check all of the pg_temp folders for osd.10, you might find something that's holding onto the pg even if it's really moved on.
>

Thanks, but osd.10 is already down and out. The disk has been broken for a while now.

Wido

> ________________________________
>
> [cid:***@aa1a4c35.419d1b46]<https://storagecraft.com> David Turner | Cloud Operations Engineer | StorageCraft Technology Corporation<https://storagecraft.com>
> 380 Data Drive Suite 300 | Draper | Utah | 84020
> Office: 801.871.2760 | Mobile: 385.224.2943
>
> ________________________________
>
> If you are not the intended recipient of this message or received it erroneously, please notify the sender and delete it, together with any attachments, and be advised that any dissemination or copying of this message is prohibited.
>
> ________________________________
>
> ________________________________
> From: ceph-users [ceph-users-***@lists.ceph.com] on behalf of David Turner [***@storagecraft.com]
> Sent: Monday, October 24, 2016 2:24 PM
> To: Wido den Hollander; ceph-***@ceph.com
> Subject: Re: [ceph-users] All PGs are active+clean, still remapped PGs
>
>
> Are you running a replica size of 4? If not, these might be errantly being reported as being on 10.
>
> ________________________________
>
> [cid:***@a622f997.4d830ea4]<https://storagecraft.com> David Turner | Cloud Operations Engineer | StorageCraft Technology Corporation<https://storagecraft.com>
> 380 Data Drive Suite 300 | Draper | Utah | 84020
> Office: 801.871.2760 | Mobile: 385.224.2943
>
> ________________________________
>
> If you are not the intended recipient of this message or received it erroneously, please notify the sender and delete it, together with any attachments, and be advised that any dissemination or copying of this message is prohibited.
>
> ________________________________
>
> ________________________________________
> From: ceph-users [ceph-users-***@lists.ceph.com] on behalf of Wido den Hollander [***@42on.com]
> Sent: Monday, October 24, 2016 2:19 PM
> To: ceph-***@ceph.com
> Subject: [ceph-users] All PGs are active+clean, still remapped PGs
>
> Hi,
>
> On a cluster running Hammer 0.94.9 (upgraded from Firefly) I have 29 remapped PGs according to the OSDMap, but all PGs are active+clean.
>
> osdmap e111208: 171 osds: 166 up, 166 in; 29 remapped pgs
>
> pgmap v101069070: 6144 pgs, 2 pools, 90122 GB data, 22787 kobjects
> 264 TB used, 184 TB / 448 TB avail
> 6144 active+clean
>
> The OSDMap shows:
>
> ***@mon1:~# ceph osd dump|grep pg_temp
> pg_temp 4.39 [160,17,10,8]
> pg_temp 4.52 [161,16,10,11]
> pg_temp 4.8b [166,29,10,7]
> pg_temp 4.b1 [5,162,148,2]
> pg_temp 4.168 [95,59,6,2]
> pg_temp 4.1ef [22,162,10,5]
> pg_temp 4.2c9 [164,95,10,7]
> pg_temp 4.330 [165,154,10,8]
> pg_temp 4.353 [2,33,18,54]
> pg_temp 4.3f8 [88,67,10,7]
> pg_temp 4.41a [30,59,10,5]
> pg_temp 4.45f [47,156,21,2]
> pg_temp 4.486 [138,43,10,7]
> pg_temp 4.674 [59,18,7,2]
> pg_temp 4.7b8 [164,68,10,11]
> pg_temp 4.816 [167,147,57,2]
> pg_temp 4.829 [82,45,10,11]
> pg_temp 4.843 [141,34,10,6]
> pg_temp 4.862 [31,160,138,2]
> pg_temp 4.868 [78,67,10,5]
> pg_temp 4.9ca [150,68,10,8]
> pg_temp 4.a83 [156,83,10,7]
> pg_temp 4.a98 [161,94,10,7]
> pg_temp 4.b80 [162,88,10,8]
> pg_temp 4.d41 [163,52,10,6]
> pg_temp 4.d54 [149,140,10,7]
> pg_temp 4.e8e [164,78,10,8]
> pg_temp 4.f2a [150,68,10,6]
> pg_temp 4.ff3 [30,157,10,7]
> ***@mon1:~#
>
> So I tried to restart osd.160 and osd.161, but that didn't chance the state.
>
> ***@mon1:~# ceph pg 4.39 query
> {
> "state": "active+clean",
> "snap_trimq": "[]",
> "epoch": 111212,
> "up": [
> 160,
> 17,
> 8
> ],
> "acting": [
> 160,
> 17,
> 8
> ],
> "actingbackfill": [
> "8",
> "17",
> "160"
> ],
>
> In all these PGs osd.10 is involved, but that OSD is down and out. I tried marking it as down again, but that didn't work.
>
> I haven't tried removing osd.10 yet from the CRUSHMap since that will trigger a rather large rebalance.
>
> This cluster is still running with the Dumpling tunables though, so that might be the issue. But before I trigger a very large rebalance I wanted to check if there are any insights on this one.
>
> Thanks,
>
> Wido
> _______________________________________________
> ceph-users mailing list
> ceph-***@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Dan van der Ster
2016-10-24 20:29:45 UTC
Permalink
Hi Wido,

This seems similar to what our dumpling tunables cluster does when a few
particular osds go down... Though in our case the remapped pgs are
correctly shown as remapped, not clean.

The fix in our case will be to enable the vary_r tunable (which will move
some data).

Cheers, Dan

On 24 Oct 2016 22:19, "Wido den Hollander" <***@42on.com> wrote:
>
> Hi,
>
> On a cluster running Hammer 0.94.9 (upgraded from Firefly) I have 29
remapped PGs according to the OSDMap, but all PGs are active+clean.
>
> osdmap e111208: 171 osds: 166 up, 166 in; 29 remapped pgs
>
> pgmap v101069070: 6144 pgs, 2 pools, 90122 GB data, 22787 kobjects
> 264 TB used, 184 TB / 448 TB avail
> 6144 active+clean
>
> The OSDMap shows:
>
> ***@mon1:~# ceph osd dump|grep pg_temp
> pg_temp 4.39 [160,17,10,8]
> pg_temp 4.52 [161,16,10,11]
> pg_temp 4.8b [166,29,10,7]
> pg_temp 4.b1 [5,162,148,2]
> pg_temp 4.168 [95,59,6,2]
> pg_temp 4.1ef [22,162,10,5]
> pg_temp 4.2c9 [164,95,10,7]
> pg_temp 4.330 [165,154,10,8]
> pg_temp 4.353 [2,33,18,54]
> pg_temp 4.3f8 [88,67,10,7]
> pg_temp 4.41a [30,59,10,5]
> pg_temp 4.45f [47,156,21,2]
> pg_temp 4.486 [138,43,10,7]
> pg_temp 4.674 [59,18,7,2]
> pg_temp 4.7b8 [164,68,10,11]
> pg_temp 4.816 [167,147,57,2]
> pg_temp 4.829 [82,45,10,11]
> pg_temp 4.843 [141,34,10,6]
> pg_temp 4.862 [31,160,138,2]
> pg_temp 4.868 [78,67,10,5]
> pg_temp 4.9ca [150,68,10,8]
> pg_temp 4.a83 [156,83,10,7]
> pg_temp 4.a98 [161,94,10,7]
> pg_temp 4.b80 [162,88,10,8]
> pg_temp 4.d41 [163,52,10,6]
> pg_temp 4.d54 [149,140,10,7]
> pg_temp 4.e8e [164,78,10,8]
> pg_temp 4.f2a [150,68,10,6]
> pg_temp 4.ff3 [30,157,10,7]
> ***@mon1:~#
>
> So I tried to restart osd.160 and osd.161, but that didn't chance the
state.
>
> ***@mon1:~# ceph pg 4.39 query
> {
> "state": "active+clean",
> "snap_trimq": "[]",
> "epoch": 111212,
> "up": [
> 160,
> 17,
> 8
> ],
> "acting": [
> 160,
> 17,
> 8
> ],
> "actingbackfill": [
> "8",
> "17",
> "160"
> ],
>
> In all these PGs osd.10 is involved, but that OSD is down and out. I
tried marking it as down again, but that didn't work.
>
> I haven't tried removing osd.10 yet from the CRUSHMap since that will
trigger a rather large rebalance.
>
> This cluster is still running with the Dumpling tunables though, so that
might be the issue. But before I trigger a very large rebalance I wanted to
check if there are any insights on this one.
>
> Thanks,
>
> Wido
> _______________________________________________
> ceph-users mailing list
> ceph-***@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Wido den Hollander
2016-10-25 05:06:09 UTC
Permalink
> Op 24 oktober 2016 om 22:29 schreef Dan van der Ster <***@vanderster.com>:
>
>
> Hi Wido,
>
> This seems similar to what our dumpling tunables cluster does when a few
> particular osds go down... Though in our case the remapped pgs are
> correctly shown as remapped, not clean.
>
> The fix in our case will be to enable the vary_r tunable (which will move
> some data).
>

Ah, as I figured. I will probably apply the Firefly tunables here. This cluster was upgraded from Dumping to Firefly and to Hammer recently and we didn't change the tunables yet.

The MON stores are 35GB each right now and I think they are not trimming due to the pg_temp which still exists.

I'll report back later, but this rebalance will take a lot of time.

Wido

> Cheers, Dan
>
> On 24 Oct 2016 22:19, "Wido den Hollander" <***@42on.com> wrote:
> >
> > Hi,
> >
> > On a cluster running Hammer 0.94.9 (upgraded from Firefly) I have 29
> remapped PGs according to the OSDMap, but all PGs are active+clean.
> >
> > osdmap e111208: 171 osds: 166 up, 166 in; 29 remapped pgs
> >
> > pgmap v101069070: 6144 pgs, 2 pools, 90122 GB data, 22787 kobjects
> > 264 TB used, 184 TB / 448 TB avail
> > 6144 active+clean
> >
> > The OSDMap shows:
> >
> > ***@mon1:~# ceph osd dump|grep pg_temp
> > pg_temp 4.39 [160,17,10,8]
> > pg_temp 4.52 [161,16,10,11]
> > pg_temp 4.8b [166,29,10,7]
> > pg_temp 4.b1 [5,162,148,2]
> > pg_temp 4.168 [95,59,6,2]
> > pg_temp 4.1ef [22,162,10,5]
> > pg_temp 4.2c9 [164,95,10,7]
> > pg_temp 4.330 [165,154,10,8]
> > pg_temp 4.353 [2,33,18,54]
> > pg_temp 4.3f8 [88,67,10,7]
> > pg_temp 4.41a [30,59,10,5]
> > pg_temp 4.45f [47,156,21,2]
> > pg_temp 4.486 [138,43,10,7]
> > pg_temp 4.674 [59,18,7,2]
> > pg_temp 4.7b8 [164,68,10,11]
> > pg_temp 4.816 [167,147,57,2]
> > pg_temp 4.829 [82,45,10,11]
> > pg_temp 4.843 [141,34,10,6]
> > pg_temp 4.862 [31,160,138,2]
> > pg_temp 4.868 [78,67,10,5]
> > pg_temp 4.9ca [150,68,10,8]
> > pg_temp 4.a83 [156,83,10,7]
> > pg_temp 4.a98 [161,94,10,7]
> > pg_temp 4.b80 [162,88,10,8]
> > pg_temp 4.d41 [163,52,10,6]
> > pg_temp 4.d54 [149,140,10,7]
> > pg_temp 4.e8e [164,78,10,8]
> > pg_temp 4.f2a [150,68,10,6]
> > pg_temp 4.ff3 [30,157,10,7]
> > ***@mon1:~#
> >
> > So I tried to restart osd.160 and osd.161, but that didn't chance the
> state.
> >
> > ***@mon1:~# ceph pg 4.39 query
> > {
> > "state": "active+clean",
> > "snap_trimq": "[]",
> > "epoch": 111212,
> > "up": [
> > 160,
> > 17,
> > 8
> > ],
> > "acting": [
> > 160,
> > 17,
> > 8
> > ],
> > "actingbackfill": [
> > "8",
> > "17",
> > "160"
> > ],
> >
> > In all these PGs osd.10 is involved, but that OSD is down and out. I
> tried marking it as down again, but that didn't work.
> >
> > I haven't tried removing osd.10 yet from the CRUSHMap since that will
> trigger a rather large rebalance.
> >
> > This cluster is still running with the Dumpling tunables though, so that
> might be the issue. But before I trigger a very large rebalance I wanted to
> check if there are any insights on this one.
> >
> > Thanks,
> >
> > Wido
> > _______________________________________________
> > ceph-users mailing list
> > ceph-***@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Dan van der Ster
2016-10-26 08:35:17 UTC
Permalink
On Tue, Oct 25, 2016 at 7:06 AM, Wido den Hollander <***@42on.com> wrote:
>
>> Op 24 oktober 2016 om 22:29 schreef Dan van der Ster <***@vanderster.com>:
>>
>>
>> Hi Wido,
>>
>> This seems similar to what our dumpling tunables cluster does when a few
>> particular osds go down... Though in our case the remapped pgs are
>> correctly shown as remapped, not clean.
>>
>> The fix in our case will be to enable the vary_r tunable (which will move
>> some data).
>>
>
> Ah, as I figured. I will probably apply the Firefly tunables here. This cluster was upgraded from Dumping to Firefly and to Hammer recently and we didn't change the tunables yet.
>
> The MON stores are 35GB each right now and I think they are not trimming due to the pg_temp which still exists.
>
> I'll report back later, but this rebalance will take a lot of time.

I forgot to mention, a workaround for the vary_r issue is to simply
remove the down/out osd from the crush map. We just hit this issue
again last night on a failed osd and after removing it from the crush
map the last degraded PG started backfilling.

Cheers, Dan


>
> Wido
>
>> Cheers, Dan
>>
>> On 24 Oct 2016 22:19, "Wido den Hollander" <***@42on.com> wrote:
>> >
>> > Hi,
>> >
>> > On a cluster running Hammer 0.94.9 (upgraded from Firefly) I have 29
>> remapped PGs according to the OSDMap, but all PGs are active+clean.
>> >
>> > osdmap e111208: 171 osds: 166 up, 166 in; 29 remapped pgs
>> >
>> > pgmap v101069070: 6144 pgs, 2 pools, 90122 GB data, 22787 kobjects
>> > 264 TB used, 184 TB / 448 TB avail
>> > 6144 active+clean
>> >
>> > The OSDMap shows:
>> >
>> > ***@mon1:~# ceph osd dump|grep pg_temp
>> > pg_temp 4.39 [160,17,10,8]
>> > pg_temp 4.52 [161,16,10,11]
>> > pg_temp 4.8b [166,29,10,7]
>> > pg_temp 4.b1 [5,162,148,2]
>> > pg_temp 4.168 [95,59,6,2]
>> > pg_temp 4.1ef [22,162,10,5]
>> > pg_temp 4.2c9 [164,95,10,7]
>> > pg_temp 4.330 [165,154,10,8]
>> > pg_temp 4.353 [2,33,18,54]
>> > pg_temp 4.3f8 [88,67,10,7]
>> > pg_temp 4.41a [30,59,10,5]
>> > pg_temp 4.45f [47,156,21,2]
>> > pg_temp 4.486 [138,43,10,7]
>> > pg_temp 4.674 [59,18,7,2]
>> > pg_temp 4.7b8 [164,68,10,11]
>> > pg_temp 4.816 [167,147,57,2]
>> > pg_temp 4.829 [82,45,10,11]
>> > pg_temp 4.843 [141,34,10,6]
>> > pg_temp 4.862 [31,160,138,2]
>> > pg_temp 4.868 [78,67,10,5]
>> > pg_temp 4.9ca [150,68,10,8]
>> > pg_temp 4.a83 [156,83,10,7]
>> > pg_temp 4.a98 [161,94,10,7]
>> > pg_temp 4.b80 [162,88,10,8]
>> > pg_temp 4.d41 [163,52,10,6]
>> > pg_temp 4.d54 [149,140,10,7]
>> > pg_temp 4.e8e [164,78,10,8]
>> > pg_temp 4.f2a [150,68,10,6]
>> > pg_temp 4.ff3 [30,157,10,7]
>> > ***@mon1:~#
>> >
>> > So I tried to restart osd.160 and osd.161, but that didn't chance the
>> state.
>> >
>> > ***@mon1:~# ceph pg 4.39 query
>> > {
>> > "state": "active+clean",
>> > "snap_trimq": "[]",
>> > "epoch": 111212,
>> > "up": [
>> > 160,
>> > 17,
>> > 8
>> > ],
>> > "acting": [
>> > 160,
>> > 17,
>> > 8
>> > ],
>> > "actingbackfill": [
>> > "8",
>> > "17",
>> > "160"
>> > ],
>> >
>> > In all these PGs osd.10 is involved, but that OSD is down and out. I
>> tried marking it as down again, but that didn't work.
>> >
>> > I haven't tried removing osd.10 yet from the CRUSHMap since that will
>> trigger a rather large rebalance.
>> >
>> > This cluster is still running with the Dumpling tunables though, so that
>> might be the issue. But before I trigger a very large rebalance I wanted to
>> check if there are any insights on this one.
>> >
>> > Thanks,
>> >
>> > Wido
>> > _______________________________________________
>> > ceph-users mailing list
>> > ceph-***@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Sage Weil
2016-10-26 08:44:03 UTC
Permalink
On Wed, 26 Oct 2016, Dan van der Ster wrote:
> On Tue, Oct 25, 2016 at 7:06 AM, Wido den Hollander <***@42on.com> wrote:
> >
> >> Op 24 oktober 2016 om 22:29 schreef Dan van der Ster <***@vanderster.com>:
> >>
> >>
> >> Hi Wido,
> >>
> >> This seems similar to what our dumpling tunables cluster does when a few
> >> particular osds go down... Though in our case the remapped pgs are
> >> correctly shown as remapped, not clean.
> >>
> >> The fix in our case will be to enable the vary_r tunable (which will move
> >> some data).
> >>
> >
> > Ah, as I figured. I will probably apply the Firefly tunables here. This cluster was upgraded from Dumping to Firefly and to Hammer recently and we didn't change the tunables yet.
> >
> > The MON stores are 35GB each right now and I think they are not trimming due to the pg_temp which still exists.
> >
> > I'll report back later, but this rebalance will take a lot of time.
>
> I forgot to mention, a workaround for the vary_r issue is to simply
> remove the down/out osd from the crush map. We just hit this issue
> again last night on a failed osd and after removing it from the crush
> map the last degraded PG started backfilling.

Also note that if you do enable vary_r, you can set it to a higher value
(like 5) to get the benefit without moving as much existing data. See the
CRUSH tunable docs for more details!

sage


>
> Cheers, Dan
>
>
> >
> > Wido
> >
> >> Cheers, Dan
> >>
> >> On 24 Oct 2016 22:19, "Wido den Hollander" <***@42on.com> wrote:
> >> >
> >> > Hi,
> >> >
> >> > On a cluster running Hammer 0.94.9 (upgraded from Firefly) I have 29
> >> remapped PGs according to the OSDMap, but all PGs are active+clean.
> >> >
> >> > osdmap e111208: 171 osds: 166 up, 166 in; 29 remapped pgs
> >> >
> >> > pgmap v101069070: 6144 pgs, 2 pools, 90122 GB data, 22787 kobjects
> >> > 264 TB used, 184 TB / 448 TB avail
> >> > 6144 active+clean
> >> >
> >> > The OSDMap shows:
> >> >
> >> > ***@mon1:~# ceph osd dump|grep pg_temp
> >> > pg_temp 4.39 [160,17,10,8]
> >> > pg_temp 4.52 [161,16,10,11]
> >> > pg_temp 4.8b [166,29,10,7]
> >> > pg_temp 4.b1 [5,162,148,2]
> >> > pg_temp 4.168 [95,59,6,2]
> >> > pg_temp 4.1ef [22,162,10,5]
> >> > pg_temp 4.2c9 [164,95,10,7]
> >> > pg_temp 4.330 [165,154,10,8]
> >> > pg_temp 4.353 [2,33,18,54]
> >> > pg_temp 4.3f8 [88,67,10,7]
> >> > pg_temp 4.41a [30,59,10,5]
> >> > pg_temp 4.45f [47,156,21,2]
> >> > pg_temp 4.486 [138,43,10,7]
> >> > pg_temp 4.674 [59,18,7,2]
> >> > pg_temp 4.7b8 [164,68,10,11]
> >> > pg_temp 4.816 [167,147,57,2]
> >> > pg_temp 4.829 [82,45,10,11]
> >> > pg_temp 4.843 [141,34,10,6]
> >> > pg_temp 4.862 [31,160,138,2]
> >> > pg_temp 4.868 [78,67,10,5]
> >> > pg_temp 4.9ca [150,68,10,8]
> >> > pg_temp 4.a83 [156,83,10,7]
> >> > pg_temp 4.a98 [161,94,10,7]
> >> > pg_temp 4.b80 [162,88,10,8]
> >> > pg_temp 4.d41 [163,52,10,6]
> >> > pg_temp 4.d54 [149,140,10,7]
> >> > pg_temp 4.e8e [164,78,10,8]
> >> > pg_temp 4.f2a [150,68,10,6]
> >> > pg_temp 4.ff3 [30,157,10,7]
> >> > ***@mon1:~#
> >> >
> >> > So I tried to restart osd.160 and osd.161, but that didn't chance the
> >> state.
> >> >
> >> > ***@mon1:~# ceph pg 4.39 query
> >> > {
> >> > "state": "active+clean",
> >> > "snap_trimq": "[]",
> >> > "epoch": 111212,
> >> > "up": [
> >> > 160,
> >> > 17,
> >> > 8
> >> > ],
> >> > "acting": [
> >> > 160,
> >> > 17,
> >> > 8
> >> > ],
> >> > "actingbackfill": [
> >> > "8",
> >> > "17",
> >> > "160"
> >> > ],
> >> >
> >> > In all these PGs osd.10 is involved, but that OSD is down and out. I
> >> tried marking it as down again, but that didn't work.
> >> >
> >> > I haven't tried removing osd.10 yet from the CRUSHMap since that will
> >> trigger a rather large rebalance.
> >> >
> >> > This cluster is still running with the Dumpling tunables though, so that
> >> might be the issue. But before I trigger a very large rebalance I wanted to
> >> check if there are any insights on this one.
> >> >
> >> > Thanks,
> >> >
> >> > Wido
> >> > _______________________________________________
> >> > ceph-users mailing list
> >> > ceph-***@lists.ceph.com
> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> _______________________________________________
> ceph-users mailing list
> ceph-***@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
Wido den Hollander
2016-10-26 09:18:07 UTC
Permalink
> Op 26 oktober 2016 om 10:44 schreef Sage Weil <***@newdream.net>:
>
>
> On Wed, 26 Oct 2016, Dan van der Ster wrote:
> > On Tue, Oct 25, 2016 at 7:06 AM, Wido den Hollander <***@42on.com> wrote:
> > >
> > >> Op 24 oktober 2016 om 22:29 schreef Dan van der Ster <***@vanderster.com>:
> > >>
> > >>
> > >> Hi Wido,
> > >>
> > >> This seems similar to what our dumpling tunables cluster does when a few
> > >> particular osds go down... Though in our case the remapped pgs are
> > >> correctly shown as remapped, not clean.
> > >>
> > >> The fix in our case will be to enable the vary_r tunable (which will move
> > >> some data).
> > >>
> > >
> > > Ah, as I figured. I will probably apply the Firefly tunables here. This cluster was upgraded from Dumping to Firefly and to Hammer recently and we didn't change the tunables yet.
> > >
> > > The MON stores are 35GB each right now and I think they are not trimming due to the pg_temp which still exists.
> > >
> > > I'll report back later, but this rebalance will take a lot of time.
> >
> > I forgot to mention, a workaround for the vary_r issue is to simply
> > remove the down/out osd from the crush map. We just hit this issue
> > again last night on a failed osd and after removing it from the crush
> > map the last degraded PG started backfilling.
>
> Also note that if you do enable vary_r, you can set it to a higher value
> (like 5) to get the benefit without moving as much existing data. See the
> CRUSH tunable docs for more details!
>

Yes, thanks. So with the input here we have a few options and are deciding which routes to take.

The cluster is rather old (hw as well), so we have to be careful at this time. For the record, our options are:

- vary_r to 1: 73% misplaced
- vary_r to 2 ~ 4: Looking into it
- Removing dead OSDs from CRUSH

As the cluster is under some stress we have to do this in the weekends, that makes it a bit difficult, but nothing we can't overcome.

Thanks again for the input and I'll report on what we did later on.

Wido

> sage
>
>
> >
> > Cheers, Dan
> >
> >
> > >
> > > Wido
> > >
> > >> Cheers, Dan
> > >>
> > >> On 24 Oct 2016 22:19, "Wido den Hollander" <***@42on.com> wrote:
> > >> >
> > >> > Hi,
> > >> >
> > >> > On a cluster running Hammer 0.94.9 (upgraded from Firefly) I have 29
> > >> remapped PGs according to the OSDMap, but all PGs are active+clean.
> > >> >
> > >> > osdmap e111208: 171 osds: 166 up, 166 in; 29 remapped pgs
> > >> >
> > >> > pgmap v101069070: 6144 pgs, 2 pools, 90122 GB data, 22787 kobjects
> > >> > 264 TB used, 184 TB / 448 TB avail
> > >> > 6144 active+clean
> > >> >
> > >> > The OSDMap shows:
> > >> >
> > >> > ***@mon1:~# ceph osd dump|grep pg_temp
> > >> > pg_temp 4.39 [160,17,10,8]
> > >> > pg_temp 4.52 [161,16,10,11]
> > >> > pg_temp 4.8b [166,29,10,7]
> > >> > pg_temp 4.b1 [5,162,148,2]
> > >> > pg_temp 4.168 [95,59,6,2]
> > >> > pg_temp 4.1ef [22,162,10,5]
> > >> > pg_temp 4.2c9 [164,95,10,7]
> > >> > pg_temp 4.330 [165,154,10,8]
> > >> > pg_temp 4.353 [2,33,18,54]
> > >> > pg_temp 4.3f8 [88,67,10,7]
> > >> > pg_temp 4.41a [30,59,10,5]
> > >> > pg_temp 4.45f [47,156,21,2]
> > >> > pg_temp 4.486 [138,43,10,7]
> > >> > pg_temp 4.674 [59,18,7,2]
> > >> > pg_temp 4.7b8 [164,68,10,11]
> > >> > pg_temp 4.816 [167,147,57,2]
> > >> > pg_temp 4.829 [82,45,10,11]
> > >> > pg_temp 4.843 [141,34,10,6]
> > >> > pg_temp 4.862 [31,160,138,2]
> > >> > pg_temp 4.868 [78,67,10,5]
> > >> > pg_temp 4.9ca [150,68,10,8]
> > >> > pg_temp 4.a83 [156,83,10,7]
> > >> > pg_temp 4.a98 [161,94,10,7]
> > >> > pg_temp 4.b80 [162,88,10,8]
> > >> > pg_temp 4.d41 [163,52,10,6]
> > >> > pg_temp 4.d54 [149,140,10,7]
> > >> > pg_temp 4.e8e [164,78,10,8]
> > >> > pg_temp 4.f2a [150,68,10,6]
> > >> > pg_temp 4.ff3 [30,157,10,7]
> > >> > ***@mon1:~#
> > >> >
> > >> > So I tried to restart osd.160 and osd.161, but that didn't chance the
> > >> state.
> > >> >
> > >> > ***@mon1:~# ceph pg 4.39 query
> > >> > {
> > >> > "state": "active+clean",
> > >> > "snap_trimq": "[]",
> > >> > "epoch": 111212,
> > >> > "up": [
> > >> > 160,
> > >> > 17,
> > >> > 8
> > >> > ],
> > >> > "acting": [
> > >> > 160,
> > >> > 17,
> > >> > 8
> > >> > ],
> > >> > "actingbackfill": [
> > >> > "8",
> > >> > "17",
> > >> > "160"
> > >> > ],
> > >> >
> > >> > In all these PGs osd.10 is involved, but that OSD is down and out. I
> > >> tried marking it as down again, but that didn't work.
> > >> >
> > >> > I haven't tried removing osd.10 yet from the CRUSHMap since that will
> > >> trigger a rather large rebalance.
> > >> >
> > >> > This cluster is still running with the Dumpling tunables though, so that
> > >> might be the issue. But before I trigger a very large rebalance I wanted to
> > >> check if there are any insights on this one.
> > >> >
> > >> > Thanks,
> > >> >
> > >> > Wido
> > >> > _______________________________________________
> > >> > ceph-users mailing list
> > >> > ceph-***@lists.ceph.com
> > >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > _______________________________________________
> > ceph-users mailing list
> > ceph-***@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> >
Wido den Hollander
2016-11-02 13:20:39 UTC
Permalink
> Op 26 oktober 2016 om 11:18 schreef Wido den Hollander <***@42on.com>:
>
>
>
> > Op 26 oktober 2016 om 10:44 schreef Sage Weil <***@newdream.net>:
> >
> >
> > On Wed, 26 Oct 2016, Dan van der Ster wrote:
> > > On Tue, Oct 25, 2016 at 7:06 AM, Wido den Hollander <***@42on.com> wrote:
> > > >
> > > >> Op 24 oktober 2016 om 22:29 schreef Dan van der Ster <***@vanderster.com>:
> > > >>
> > > >>
> > > >> Hi Wido,
> > > >>
> > > >> This seems similar to what our dumpling tunables cluster does when a few
> > > >> particular osds go down... Though in our case the remapped pgs are
> > > >> correctly shown as remapped, not clean.
> > > >>
> > > >> The fix in our case will be to enable the vary_r tunable (which will move
> > > >> some data).
> > > >>
> > > >
> > > > Ah, as I figured. I will probably apply the Firefly tunables here. This cluster was upgraded from Dumping to Firefly and to Hammer recently and we didn't change the tunables yet.
> > > >
> > > > The MON stores are 35GB each right now and I think they are not trimming due to the pg_temp which still exists.
> > > >
> > > > I'll report back later, but this rebalance will take a lot of time.
> > >
> > > I forgot to mention, a workaround for the vary_r issue is to simply
> > > remove the down/out osd from the crush map. We just hit this issue
> > > again last night on a failed osd and after removing it from the crush
> > > map the last degraded PG started backfilling.
> >
> > Also note that if you do enable vary_r, you can set it to a higher value
> > (like 5) to get the benefit without moving as much existing data. See the
> > CRUSH tunable docs for more details!
> >
>
> Yes, thanks. So with the input here we have a few options and are deciding which routes to take.
>
> The cluster is rather old (hw as well), so we have to be careful at this time. For the record, our options are:
>
> - vary_r to 1: 73% misplaced
> - vary_r to 2 ~ 4: Looking into it
> - Removing dead OSDs from CRUSH
>
> As the cluster is under some stress we have to do this in the weekends, that makes it a bit difficult, but nothing we can't overcome.
>
> Thanks again for the input and I'll report on what we did later on.
>

So, what I did:
- Remove all dead OSDs from the CRUSHMap and OSDMap
- Set vary_r to 2

This resulted in:

osdmap e119647: 169 osds: 166 up, 166 in; 6 remapped pgs

pg_temp 4.39 [160,17,10,8]
pg_temp 4.2c9 [164,95,10,7]
pg_temp 4.816 [167,147,57,2]
pg_temp 4.862 [31,160,138,2]
pg_temp 4.a83 [156,83,10,7]
pg_temp 4.e8e [164,78,10,8]

In this case, osd 2 and 10 no longer exist, not in the OSDMap nor in the CRUSHMap.

***@mon1:~# ceph osd metadata 2
Error ENOENT: osd.2 does not exist
***@mon1:~# ceph osd metadata 10
Error ENOENT: osd.10 does not exist
***@mon1:~# ceph osd find 2
Error ENOENT: osd.2 does not exist
***@mon1:~# ceph osd find 10
Error ENOENT: osd.10 does not exist
***@mon1:~#

Looking at PG '4.39' for example, a query tells me:

"up": [
160,
17,
8
],
"acting": [
160,
17,
8
],

So I really wonder there the pg_temp with osd.10 comes from.

Setting vary_r to 1 will result in a 76% degraded state for the cluster and I'm trying to avoid that (for now).

I restarted the Primary OSDs for all the affected PGs, but that didn't help either.

Any bright ideas on how to fix this?

Wido

> Wido
>
> > sage
> >
> >
> > >
> > > Cheers, Dan
> > >
> > >
> > > >
> > > > Wido
> > > >
> > > >> Cheers, Dan
> > > >>
> > > >> On 24 Oct 2016 22:19, "Wido den Hollander" <***@42on.com> wrote:
> > > >> >
> > > >> > Hi,
> > > >> >
> > > >> > On a cluster running Hammer 0.94.9 (upgraded from Firefly) I have 29
> > > >> remapped PGs according to the OSDMap, but all PGs are active+clean.
> > > >> >
> > > >> > osdmap e111208: 171 osds: 166 up, 166 in; 29 remapped pgs
> > > >> >
> > > >> > pgmap v101069070: 6144 pgs, 2 pools, 90122 GB data, 22787 kobjects
> > > >> > 264 TB used, 184 TB / 448 TB avail
> > > >> > 6144 active+clean
> > > >> >
> > > >> > The OSDMap shows:
> > > >> >
> > > >> > ***@mon1:~# ceph osd dump|grep pg_temp
> > > >> > pg_temp 4.39 [160,17,10,8]
> > > >> > pg_temp 4.52 [161,16,10,11]
> > > >> > pg_temp 4.8b [166,29,10,7]
> > > >> > pg_temp 4.b1 [5,162,148,2]
> > > >> > pg_temp 4.168 [95,59,6,2]
> > > >> > pg_temp 4.1ef [22,162,10,5]
> > > >> > pg_temp 4.2c9 [164,95,10,7]
> > > >> > pg_temp 4.330 [165,154,10,8]
> > > >> > pg_temp 4.353 [2,33,18,54]
> > > >> > pg_temp 4.3f8 [88,67,10,7]
> > > >> > pg_temp 4.41a [30,59,10,5]
> > > >> > pg_temp 4.45f [47,156,21,2]
> > > >> > pg_temp 4.486 [138,43,10,7]
> > > >> > pg_temp 4.674 [59,18,7,2]
> > > >> > pg_temp 4.7b8 [164,68,10,11]
> > > >> > pg_temp 4.816 [167,147,57,2]
> > > >> > pg_temp 4.829 [82,45,10,11]
> > > >> > pg_temp 4.843 [141,34,10,6]
> > > >> > pg_temp 4.862 [31,160,138,2]
> > > >> > pg_temp 4.868 [78,67,10,5]
> > > >> > pg_temp 4.9ca [150,68,10,8]
> > > >> > pg_temp 4.a83 [156,83,10,7]
> > > >> > pg_temp 4.a98 [161,94,10,7]
> > > >> > pg_temp 4.b80 [162,88,10,8]
> > > >> > pg_temp 4.d41 [163,52,10,6]
> > > >> > pg_temp 4.d54 [149,140,10,7]
> > > >> > pg_temp 4.e8e [164,78,10,8]
> > > >> > pg_temp 4.f2a [150,68,10,6]
> > > >> > pg_temp 4.ff3 [30,157,10,7]
> > > >> > ***@mon1:~#
> > > >> >
> > > >> > So I tried to restart osd.160 and osd.161, but that didn't chance the
> > > >> state.
> > > >> >
> > > >> > ***@mon1:~# ceph pg 4.39 query
> > > >> > {
> > > >> > "state": "active+clean",
> > > >> > "snap_trimq": "[]",
> > > >> > "epoch": 111212,
> > > >> > "up": [
> > > >> > 160,
> > > >> > 17,
> > > >> > 8
> > > >> > ],
> > > >> > "acting": [
> > > >> > 160,
> > > >> > 17,
> > > >> > 8
> > > >> > ],
> > > >> > "actingbackfill": [
> > > >> > "8",
> > > >> > "17",
> > > >> > "160"
> > > >> > ],
> > > >> >
> > > >> > In all these PGs osd.10 is involved, but that OSD is down and out. I
> > > >> tried marking it as down again, but that didn't work.
> > > >> >
> > > >> > I haven't tried removing osd.10 yet from the CRUSHMap since that will
> > > >> trigger a rather large rebalance.
> > > >> >
> > > >> > This cluster is still running with the Dumpling tunables though, so that
> > > >> might be the issue. But before I trigger a very large rebalance I wanted to
> > > >> check if there are any insights on this one.
> > > >> >
> > > >> > Thanks,
> > > >> >
> > > >> > Wido
> > > >> > _______________________________________________
> > > >> > ceph-users mailing list
> > > >> > ceph-***@lists.ceph.com
> > > >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > _______________________________________________
> > > ceph-users mailing list
> > > ceph-***@lists.ceph.com
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >
> > >
> _______________________________________________
> ceph-users mailing list
> ceph-***@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Sage Weil
2016-11-02 13:30:42 UTC
Permalink
On Wed, 2 Nov 2016, Wido den Hollander wrote:
>
> > Op 26 oktober 2016 om 11:18 schreef Wido den Hollander <***@42on.com>:
> >
> >
> >
> > > Op 26 oktober 2016 om 10:44 schreef Sage Weil <***@newdream.net>:
> > >
> > >
> > > On Wed, 26 Oct 2016, Dan van der Ster wrote:
> > > > On Tue, Oct 25, 2016 at 7:06 AM, Wido den Hollander <***@42on.com> wrote:
> > > > >
> > > > >> Op 24 oktober 2016 om 22:29 schreef Dan van der Ster <***@vanderster.com>:
> > > > >>
> > > > >>
> > > > >> Hi Wido,
> > > > >>
> > > > >> This seems similar to what our dumpling tunables cluster does when a few
> > > > >> particular osds go down... Though in our case the remapped pgs are
> > > > >> correctly shown as remapped, not clean.
> > > > >>
> > > > >> The fix in our case will be to enable the vary_r tunable (which will move
> > > > >> some data).
> > > > >>
> > > > >
> > > > > Ah, as I figured. I will probably apply the Firefly tunables here. This cluster was upgraded from Dumping to Firefly and to Hammer recently and we didn't change the tunables yet.
> > > > >
> > > > > The MON stores are 35GB each right now and I think they are not trimming due to the pg_temp which still exists.
> > > > >
> > > > > I'll report back later, but this rebalance will take a lot of time.
> > > >
> > > > I forgot to mention, a workaround for the vary_r issue is to simply
> > > > remove the down/out osd from the crush map. We just hit this issue
> > > > again last night on a failed osd and after removing it from the crush
> > > > map the last degraded PG started backfilling.
> > >
> > > Also note that if you do enable vary_r, you can set it to a higher value
> > > (like 5) to get the benefit without moving as much existing data. See the
> > > CRUSH tunable docs for more details!
> > >
> >
> > Yes, thanks. So with the input here we have a few options and are deciding which routes to take.
> >
> > The cluster is rather old (hw as well), so we have to be careful at this time. For the record, our options are:
> >
> > - vary_r to 1: 73% misplaced
> > - vary_r to 2 ~ 4: Looking into it
> > - Removing dead OSDs from CRUSH
> >
> > As the cluster is under some stress we have to do this in the weekends, that makes it a bit difficult, but nothing we can't overcome.
> >
> > Thanks again for the input and I'll report on what we did later on.
> >
>
> So, what I did:
> - Remove all dead OSDs from the CRUSHMap and OSDMap
> - Set vary_r to 2
>
> This resulted in:
>
> osdmap e119647: 169 osds: 166 up, 166 in; 6 remapped pgs
>
> pg_temp 4.39 [160,17,10,8]
> pg_temp 4.2c9 [164,95,10,7]
> pg_temp 4.816 [167,147,57,2]
> pg_temp 4.862 [31,160,138,2]
> pg_temp 4.a83 [156,83,10,7]
> pg_temp 4.e8e [164,78,10,8]
>
> In this case, osd 2 and 10 no longer exist, not in the OSDMap nor in the CRUSHMap.
>
> ***@mon1:~# ceph osd metadata 2
> Error ENOENT: osd.2 does not exist
> ***@mon1:~# ceph osd metadata 10
> Error ENOENT: osd.10 does not exist
> ***@mon1:~# ceph osd find 2
> Error ENOENT: osd.2 does not exist
> ***@mon1:~# ceph osd find 10
> Error ENOENT: osd.10 does not exist
> ***@mon1:~#
>
> Looking at PG '4.39' for example, a query tells me:
>
> "up": [
> 160,
> 17,
> 8
> ],
> "acting": [
> 160,
> 17,
> 8
> ],
>
> So I really wonder there the pg_temp with osd.10 comes from.

Hmm.. are the others also the same like that? You can manually poke
it into adjusting pg-temp with

ceph osd pg_temp <pgid> <just the primary osd>

That'll make peering reevaluate what pg_temp it wants (if any). It might
be that it isn't noticing that pg_temp matches acting.. but the mon has
special code to remove those entries, so hrm. Is this hammer?

> Setting vary_r to 1 will result in a 76% degraded state for the cluster
> and I'm trying to avoid that (for now).
>
> I restarted the Primary OSDs for all the affected PGs, but that didn't
> help either.
>
> Any bright ideas on how to fix this?

This part seems unrelated to vary_r... you shouldn't have to
reduce it further!

sage


>
> Wido
>
> > Wido
> >
> > > sage
> > >
> > >
> > > >
> > > > Cheers, Dan
> > > >
> > > >
> > > > >
> > > > > Wido
> > > > >
> > > > >> Cheers, Dan
> > > > >>
> > > > >> On 24 Oct 2016 22:19, "Wido den Hollander" <***@42on.com> wrote:
> > > > >> >
> > > > >> > Hi,
> > > > >> >
> > > > >> > On a cluster running Hammer 0.94.9 (upgraded from Firefly) I have 29
> > > > >> remapped PGs according to the OSDMap, but all PGs are active+clean.
> > > > >> >
> > > > >> > osdmap e111208: 171 osds: 166 up, 166 in; 29 remapped pgs
> > > > >> >
> > > > >> > pgmap v101069070: 6144 pgs, 2 pools, 90122 GB data, 22787 kobjects
> > > > >> > 264 TB used, 184 TB / 448 TB avail
> > > > >> > 6144 active+clean
> > > > >> >
> > > > >> > The OSDMap shows:
> > > > >> >
> > > > >> > ***@mon1:~# ceph osd dump|grep pg_temp
> > > > >> > pg_temp 4.39 [160,17,10,8]
> > > > >> > pg_temp 4.52 [161,16,10,11]
> > > > >> > pg_temp 4.8b [166,29,10,7]
> > > > >> > pg_temp 4.b1 [5,162,148,2]
> > > > >> > pg_temp 4.168 [95,59,6,2]
> > > > >> > pg_temp 4.1ef [22,162,10,5]
> > > > >> > pg_temp 4.2c9 [164,95,10,7]
> > > > >> > pg_temp 4.330 [165,154,10,8]
> > > > >> > pg_temp 4.353 [2,33,18,54]
> > > > >> > pg_temp 4.3f8 [88,67,10,7]
> > > > >> > pg_temp 4.41a [30,59,10,5]
> > > > >> > pg_temp 4.45f [47,156,21,2]
> > > > >> > pg_temp 4.486 [138,43,10,7]
> > > > >> > pg_temp 4.674 [59,18,7,2]
> > > > >> > pg_temp 4.7b8 [164,68,10,11]
> > > > >> > pg_temp 4.816 [167,147,57,2]
> > > > >> > pg_temp 4.829 [82,45,10,11]
> > > > >> > pg_temp 4.843 [141,34,10,6]
> > > > >> > pg_temp 4.862 [31,160,138,2]
> > > > >> > pg_temp 4.868 [78,67,10,5]
> > > > >> > pg_temp 4.9ca [150,68,10,8]
> > > > >> > pg_temp 4.a83 [156,83,10,7]
> > > > >> > pg_temp 4.a98 [161,94,10,7]
> > > > >> > pg_temp 4.b80 [162,88,10,8]
> > > > >> > pg_temp 4.d41 [163,52,10,6]
> > > > >> > pg_temp 4.d54 [149,140,10,7]
> > > > >> > pg_temp 4.e8e [164,78,10,8]
> > > > >> > pg_temp 4.f2a [150,68,10,6]
> > > > >> > pg_temp 4.ff3 [30,157,10,7]
> > > > >> > ***@mon1:~#
> > > > >> >
> > > > >> > So I tried to restart osd.160 and osd.161, but that didn't chance the
> > > > >> state.
> > > > >> >
> > > > >> > ***@mon1:~# ceph pg 4.39 query
> > > > >> > {
> > > > >> > "state": "active+clean",
> > > > >> > "snap_trimq": "[]",
> > > > >> > "epoch": 111212,
> > > > >> > "up": [
> > > > >> > 160,
> > > > >> > 17,
> > > > >> > 8
> > > > >> > ],
> > > > >> > "acting": [
> > > > >> > 160,
> > > > >> > 17,
> > > > >> > 8
> > > > >> > ],
> > > > >> > "actingbackfill": [
> > > > >> > "8",
> > > > >> > "17",
> > > > >> > "160"
> > > > >> > ],
> > > > >> >
> > > > >> > In all these PGs osd.10 is involved, but that OSD is down and out. I
> > > > >> tried marking it as down again, but that didn't work.
> > > > >> >
> > > > >> > I haven't tried removing osd.10 yet from the CRUSHMap since that will
> > > > >> trigger a rather large rebalance.
> > > > >> >
> > > > >> > This cluster is still running with the Dumpling tunables though, so that
> > > > >> might be the issue. But before I trigger a very large rebalance I wanted to
> > > > >> check if there are any insights on this one.
> > > > >> >
> > > > >> > Thanks,
> > > > >> >
> > > > >> > Wido
> > > > >> > _______________________________________________
> > > > >> > ceph-users mailing list
> > > > >> > ceph-***@lists.ceph.com
> > > > >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > > _______________________________________________
> > > > ceph-users mailing list
> > > > ceph-***@lists.ceph.com
> > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > >
> > > >
> > _______________________________________________
> > ceph-users mailing list
> > ceph-***@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
Wido den Hollander
2016-11-02 13:47:46 UTC
Permalink
> Op 2 november 2016 om 14:30 schreef Sage Weil <***@newdream.net>:
>
>
> On Wed, 2 Nov 2016, Wido den Hollander wrote:
> >
> > > Op 26 oktober 2016 om 11:18 schreef Wido den Hollander <***@42on.com>:
> > >
> > >
> > >
> > > > Op 26 oktober 2016 om 10:44 schreef Sage Weil <***@newdream.net>:
> > > >
> > > >
> > > > On Wed, 26 Oct 2016, Dan van der Ster wrote:
> > > > > On Tue, Oct 25, 2016 at 7:06 AM, Wido den Hollander <***@42on.com> wrote:
> > > > > >
> > > > > >> Op 24 oktober 2016 om 22:29 schreef Dan van der Ster <***@vanderster.com>:
> > > > > >>
> > > > > >>
> > > > > >> Hi Wido,
> > > > > >>
> > > > > >> This seems similar to what our dumpling tunables cluster does when a few
> > > > > >> particular osds go down... Though in our case the remapped pgs are
> > > > > >> correctly shown as remapped, not clean.
> > > > > >>
> > > > > >> The fix in our case will be to enable the vary_r tunable (which will move
> > > > > >> some data).
> > > > > >>
> > > > > >
> > > > > > Ah, as I figured. I will probably apply the Firefly tunables here. This cluster was upgraded from Dumping to Firefly and to Hammer recently and we didn't change the tunables yet.
> > > > > >
> > > > > > The MON stores are 35GB each right now and I think they are not trimming due to the pg_temp which still exists.
> > > > > >
> > > > > > I'll report back later, but this rebalance will take a lot of time.
> > > > >
> > > > > I forgot to mention, a workaround for the vary_r issue is to simply
> > > > > remove the down/out osd from the crush map. We just hit this issue
> > > > > again last night on a failed osd and after removing it from the crush
> > > > > map the last degraded PG started backfilling.
> > > >
> > > > Also note that if you do enable vary_r, you can set it to a higher value
> > > > (like 5) to get the benefit without moving as much existing data. See the
> > > > CRUSH tunable docs for more details!
> > > >
> > >
> > > Yes, thanks. So with the input here we have a few options and are deciding which routes to take.
> > >
> > > The cluster is rather old (hw as well), so we have to be careful at this time. For the record, our options are:
> > >
> > > - vary_r to 1: 73% misplaced
> > > - vary_r to 2 ~ 4: Looking into it
> > > - Removing dead OSDs from CRUSH
> > >
> > > As the cluster is under some stress we have to do this in the weekends, that makes it a bit difficult, but nothing we can't overcome.
> > >
> > > Thanks again for the input and I'll report on what we did later on.
> > >
> >
> > So, what I did:
> > - Remove all dead OSDs from the CRUSHMap and OSDMap
> > - Set vary_r to 2
> >
> > This resulted in:
> >
> > osdmap e119647: 169 osds: 166 up, 166 in; 6 remapped pgs
> >
> > pg_temp 4.39 [160,17,10,8]
> > pg_temp 4.2c9 [164,95,10,7]
> > pg_temp 4.816 [167,147,57,2]
> > pg_temp 4.862 [31,160,138,2]
> > pg_temp 4.a83 [156,83,10,7]
> > pg_temp 4.e8e [164,78,10,8]
> >
> > In this case, osd 2 and 10 no longer exist, not in the OSDMap nor in the CRUSHMap.
> >
> > ***@mon1:~# ceph osd metadata 2
> > Error ENOENT: osd.2 does not exist
> > ***@mon1:~# ceph osd metadata 10
> > Error ENOENT: osd.10 does not exist
> > ***@mon1:~# ceph osd find 2
> > Error ENOENT: osd.2 does not exist
> > ***@mon1:~# ceph osd find 10
> > Error ENOENT: osd.10 does not exist
> > ***@mon1:~#
> >
> > Looking at PG '4.39' for example, a query tells me:
> >
> > "up": [
> > 160,
> > 17,
> > 8
> > ],
> > "acting": [
> > 160,
> > 17,
> > 8
> > ],
> >
> > So I really wonder there the pg_temp with osd.10 comes from.
>
> Hmm.. are the others also the same like that? You can manually poke
> it into adjusting pg-temp with
>
> ceph osd pg_temp <pgid> <just the primary osd>
>
> That'll make peering reevaluate what pg_temp it wants (if any). It might
> be that it isn't noticing that pg_temp matches acting.. but the mon has
> special code to remove those entries, so hrm. Is this hammer?
>

So yes, that worked. I did it for 3 PGs:

# ceph osd pg-temp 4.39 160
# ceph osd pg-temp 4.2c9 164
# ceph osd pg-temp 4.816 167

Now my pg_temp looks like:

pg_temp 4.862 [31,160,138,2]
pg_temp 4.a83 [156,83,10,7]
pg_temp 4.e8e [164,78,10,8]

There we see the osd.2 and osd.10 again. I'm not setting these yet since you might want logs from the MONs or OSDs?

This is Hammer 0.94.9

> > Setting vary_r to 1 will result in a 76% degraded state for the cluster
> > and I'm trying to avoid that (for now).
> >
> > I restarted the Primary OSDs for all the affected PGs, but that didn't
> > help either.
> >
> > Any bright ideas on how to fix this?
>
> This part seems unrelated to vary_r... you shouldn't have to
> reduce it further!
>

Indeed, like you said, the pg_temp fixed it for 3 PGs already. Holding off with the rest in case you want logs or debug it further.

Wido

> sage
>
>
> >
> > Wido
> >
> > > Wido
> > >
> > > > sage
> > > >
> > > >
> > > > >
> > > > > Cheers, Dan
> > > > >
> > > > >
> > > > > >
> > > > > > Wido
> > > > > >
> > > > > >> Cheers, Dan
> > > > > >>
> > > > > >> On 24 Oct 2016 22:19, "Wido den Hollander" <***@42on.com> wrote:
> > > > > >> >
> > > > > >> > Hi,
> > > > > >> >
> > > > > >> > On a cluster running Hammer 0.94.9 (upgraded from Firefly) I have 29
> > > > > >> remapped PGs according to the OSDMap, but all PGs are active+clean.
> > > > > >> >
> > > > > >> > osdmap e111208: 171 osds: 166 up, 166 in; 29 remapped pgs
> > > > > >> >
> > > > > >> > pgmap v101069070: 6144 pgs, 2 pools, 90122 GB data, 22787 kobjects
> > > > > >> > 264 TB used, 184 TB / 448 TB avail
> > > > > >> > 6144 active+clean
> > > > > >> >
> > > > > >> > The OSDMap shows:
> > > > > >> >
> > > > > >> > ***@mon1:~# ceph osd dump|grep pg_temp
> > > > > >> > pg_temp 4.39 [160,17,10,8]
> > > > > >> > pg_temp 4.52 [161,16,10,11]
> > > > > >> > pg_temp 4.8b [166,29,10,7]
> > > > > >> > pg_temp 4.b1 [5,162,148,2]
> > > > > >> > pg_temp 4.168 [95,59,6,2]
> > > > > >> > pg_temp 4.1ef [22,162,10,5]
> > > > > >> > pg_temp 4.2c9 [164,95,10,7]
> > > > > >> > pg_temp 4.330 [165,154,10,8]
> > > > > >> > pg_temp 4.353 [2,33,18,54]
> > > > > >> > pg_temp 4.3f8 [88,67,10,7]
> > > > > >> > pg_temp 4.41a [30,59,10,5]
> > > > > >> > pg_temp 4.45f [47,156,21,2]
> > > > > >> > pg_temp 4.486 [138,43,10,7]
> > > > > >> > pg_temp 4.674 [59,18,7,2]
> > > > > >> > pg_temp 4.7b8 [164,68,10,11]
> > > > > >> > pg_temp 4.816 [167,147,57,2]
> > > > > >> > pg_temp 4.829 [82,45,10,11]
> > > > > >> > pg_temp 4.843 [141,34,10,6]
> > > > > >> > pg_temp 4.862 [31,160,138,2]
> > > > > >> > pg_temp 4.868 [78,67,10,5]
> > > > > >> > pg_temp 4.9ca [150,68,10,8]
> > > > > >> > pg_temp 4.a83 [156,83,10,7]
> > > > > >> > pg_temp 4.a98 [161,94,10,7]
> > > > > >> > pg_temp 4.b80 [162,88,10,8]
> > > > > >> > pg_temp 4.d41 [163,52,10,6]
> > > > > >> > pg_temp 4.d54 [149,140,10,7]
> > > > > >> > pg_temp 4.e8e [164,78,10,8]
> > > > > >> > pg_temp 4.f2a [150,68,10,6]
> > > > > >> > pg_temp 4.ff3 [30,157,10,7]
> > > > > >> > ***@mon1:~#
> > > > > >> >
> > > > > >> > So I tried to restart osd.160 and osd.161, but that didn't chance the
> > > > > >> state.
> > > > > >> >
> > > > > >> > ***@mon1:~# ceph pg 4.39 query
> > > > > >> > {
> > > > > >> > "state": "active+clean",
> > > > > >> > "snap_trimq": "[]",
> > > > > >> > "epoch": 111212,
> > > > > >> > "up": [
> > > > > >> > 160,
> > > > > >> > 17,
> > > > > >> > 8
> > > > > >> > ],
> > > > > >> > "acting": [
> > > > > >> > 160,
> > > > > >> > 17,
> > > > > >> > 8
> > > > > >> > ],
> > > > > >> > "actingbackfill": [
> > > > > >> > "8",
> > > > > >> > "17",
> > > > > >> > "160"
> > > > > >> > ],
> > > > > >> >
> > > > > >> > In all these PGs osd.10 is involved, but that OSD is down and out. I
> > > > > >> tried marking it as down again, but that didn't work.
> > > > > >> >
> > > > > >> > I haven't tried removing osd.10 yet from the CRUSHMap since that will
> > > > > >> trigger a rather large rebalance.
> > > > > >> >
> > > > > >> > This cluster is still running with the Dumpling tunables though, so that
> > > > > >> might be the issue. But before I trigger a very large rebalance I wanted to
> > > > > >> check if there are any insights on this one.
> > > > > >> >
> > > > > >> > Thanks,
> > > > > >> >
> > > > > >> > Wido
> > > > > >> > _______________________________________________
> > > > > >> > ceph-users mailing list
> > > > > >> > ceph-***@lists.ceph.com
> > > > > >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > > > _______________________________________________
> > > > > ceph-users mailing list
> > > > > ceph-***@lists.ceph.com
> > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > > >
> > > > >
> > > _______________________________________________
> > > ceph-users mailing list
> > > ceph-***@lists.ceph.com
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> >
Sage Weil
2016-11-02 14:06:33 UTC
Permalink
On Wed, 2 Nov 2016, Wido den Hollander wrote:
>
> > Op 2 november 2016 om 14:30 schreef Sage Weil <***@newdream.net>:
> >
> >
> > On Wed, 2 Nov 2016, Wido den Hollander wrote:
> > >
> > > > Op 26 oktober 2016 om 11:18 schreef Wido den Hollander <***@42on.com>:
> > > >
> > > >
> > > >
> > > > > Op 26 oktober 2016 om 10:44 schreef Sage Weil <***@newdream.net>:
> > > > >
> > > > >
> > > > > On Wed, 26 Oct 2016, Dan van der Ster wrote:
> > > > > > On Tue, Oct 25, 2016 at 7:06 AM, Wido den Hollander <***@42on.com> wrote:
> > > > > > >
> > > > > > >> Op 24 oktober 2016 om 22:29 schreef Dan van der Ster <***@vanderster.com>:
> > > > > > >>
> > > > > > >>
> > > > > > >> Hi Wido,
> > > > > > >>
> > > > > > >> This seems similar to what our dumpling tunables cluster does when a few
> > > > > > >> particular osds go down... Though in our case the remapped pgs are
> > > > > > >> correctly shown as remapped, not clean.
> > > > > > >>
> > > > > > >> The fix in our case will be to enable the vary_r tunable (which will move
> > > > > > >> some data).
> > > > > > >>
> > > > > > >
> > > > > > > Ah, as I figured. I will probably apply the Firefly tunables here. This cluster was upgraded from Dumping to Firefly and to Hammer recently and we didn't change the tunables yet.
> > > > > > >
> > > > > > > The MON stores are 35GB each right now and I think they are not trimming due to the pg_temp which still exists.
> > > > > > >
> > > > > > > I'll report back later, but this rebalance will take a lot of time.
> > > > > >
> > > > > > I forgot to mention, a workaround for the vary_r issue is to simply
> > > > > > remove the down/out osd from the crush map. We just hit this issue
> > > > > > again last night on a failed osd and after removing it from the crush
> > > > > > map the last degraded PG started backfilling.
> > > > >
> > > > > Also note that if you do enable vary_r, you can set it to a higher value
> > > > > (like 5) to get the benefit without moving as much existing data. See the
> > > > > CRUSH tunable docs for more details!
> > > > >
> > > >
> > > > Yes, thanks. So with the input here we have a few options and are deciding which routes to take.
> > > >
> > > > The cluster is rather old (hw as well), so we have to be careful at this time. For the record, our options are:
> > > >
> > > > - vary_r to 1: 73% misplaced
> > > > - vary_r to 2 ~ 4: Looking into it
> > > > - Removing dead OSDs from CRUSH
> > > >
> > > > As the cluster is under some stress we have to do this in the weekends, that makes it a bit difficult, but nothing we can't overcome.
> > > >
> > > > Thanks again for the input and I'll report on what we did later on.
> > > >
> > >
> > > So, what I did:
> > > - Remove all dead OSDs from the CRUSHMap and OSDMap
> > > - Set vary_r to 2
> > >
> > > This resulted in:
> > >
> > > osdmap e119647: 169 osds: 166 up, 166 in; 6 remapped pgs
> > >
> > > pg_temp 4.39 [160,17,10,8]
> > > pg_temp 4.2c9 [164,95,10,7]
> > > pg_temp 4.816 [167,147,57,2]
> > > pg_temp 4.862 [31,160,138,2]
> > > pg_temp 4.a83 [156,83,10,7]
> > > pg_temp 4.e8e [164,78,10,8]
> > >
> > > In this case, osd 2 and 10 no longer exist, not in the OSDMap nor in the CRUSHMap.
> > >
> > > ***@mon1:~# ceph osd metadata 2
> > > Error ENOENT: osd.2 does not exist
> > > ***@mon1:~# ceph osd metadata 10
> > > Error ENOENT: osd.10 does not exist
> > > ***@mon1:~# ceph osd find 2
> > > Error ENOENT: osd.2 does not exist
> > > ***@mon1:~# ceph osd find 10
> > > Error ENOENT: osd.10 does not exist
> > > ***@mon1:~#
> > >
> > > Looking at PG '4.39' for example, a query tells me:
> > >
> > > "up": [
> > > 160,
> > > 17,
> > > 8
> > > ],
> > > "acting": [
> > > 160,
> > > 17,
> > > 8
> > > ],
> > >
> > > So I really wonder there the pg_temp with osd.10 comes from.
> >
> > Hmm.. are the others also the same like that? You can manually poke
> > it into adjusting pg-temp with
> >
> > ceph osd pg_temp <pgid> <just the primary osd>
> >
> > That'll make peering reevaluate what pg_temp it wants (if any). It might
> > be that it isn't noticing that pg_temp matches acting.. but the mon has
> > special code to remove those entries, so hrm. Is this hammer?
> >
>
> So yes, that worked. I did it for 3 PGs:
>
> # ceph osd pg-temp 4.39 160
> # ceph osd pg-temp 4.2c9 164
> # ceph osd pg-temp 4.816 167
>
> Now my pg_temp looks like:
>
> pg_temp 4.862 [31,160,138,2]
> pg_temp 4.a83 [156,83,10,7]
> pg_temp 4.e8e [164,78,10,8]
>
> There we see the osd.2 and osd.10 again. I'm not setting these yet since you might want logs from the MONs or OSDs?
>
> This is Hammer 0.94.9

I'm pretty sure this is a race condition that got cleaned up as part of
https://github.com/ceph/ceph/pull/9078/commits. The mon only checks the
pg_temp entries that are getting set/changed, and since those are already
in place it doesn't recheck them. Any poke to the cluster that triggers
peering ought to be enough to clear it up. So, no need for logs, thanks!

We could add a special check during, say, upgrade, but generally the PGs
will re-peer as the OSDs restart anyway and that will clear it up.

Maybe you can just confirm that marking an osd down (say, ceph osd down
31) is also enough to remove the stray entry?

Thanks!
sage

>
> > > Setting vary_r to 1 will result in a 76% degraded state for the cluster
> > > and I'm trying to avoid that (for now).
> > >
> > > I restarted the Primary OSDs for all the affected PGs, but that didn't
> > > help either.
> > >
> > > Any bright ideas on how to fix this?
> >
> > This part seems unrelated to vary_r... you shouldn't have to
> > reduce it further!
> >
>
> Indeed, like you said, the pg_temp fixed it for 3 PGs already. Holding off with the rest in case you want logs or debug it further.
>
> Wido
>
> > sage
> >
> >
> > >
> > > Wido
> > >
> > > > Wido
> > > >
> > > > > sage
> > > > >
> > > > >
> > > > > >
> > > > > > Cheers, Dan
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > Wido
> > > > > > >
> > > > > > >> Cheers, Dan
> > > > > > >>
> > > > > > >> On 24 Oct 2016 22:19, "Wido den Hollander" <***@42on.com> wrote:
> > > > > > >> >
> > > > > > >> > Hi,
> > > > > > >> >
> > > > > > >> > On a cluster running Hammer 0.94.9 (upgraded from Firefly) I have 29
> > > > > > >> remapped PGs according to the OSDMap, but all PGs are active+clean.
> > > > > > >> >
> > > > > > >> > osdmap e111208: 171 osds: 166 up, 166 in; 29 remapped pgs
> > > > > > >> >
> > > > > > >> > pgmap v101069070: 6144 pgs, 2 pools, 90122 GB data, 22787 kobjects
> > > > > > >> > 264 TB used, 184 TB / 448 TB avail
> > > > > > >> > 6144 active+clean
> > > > > > >> >
> > > > > > >> > The OSDMap shows:
> > > > > > >> >
> > > > > > >> > ***@mon1:~# ceph osd dump|grep pg_temp
> > > > > > >> > pg_temp 4.39 [160,17,10,8]
> > > > > > >> > pg_temp 4.52 [161,16,10,11]
> > > > > > >> > pg_temp 4.8b [166,29,10,7]
> > > > > > >> > pg_temp 4.b1 [5,162,148,2]
> > > > > > >> > pg_temp 4.168 [95,59,6,2]
> > > > > > >> > pg_temp 4.1ef [22,162,10,5]
> > > > > > >> > pg_temp 4.2c9 [164,95,10,7]
> > > > > > >> > pg_temp 4.330 [165,154,10,8]
> > > > > > >> > pg_temp 4.353 [2,33,18,54]
> > > > > > >> > pg_temp 4.3f8 [88,67,10,7]
> > > > > > >> > pg_temp 4.41a [30,59,10,5]
> > > > > > >> > pg_temp 4.45f [47,156,21,2]
> > > > > > >> > pg_temp 4.486 [138,43,10,7]
> > > > > > >> > pg_temp 4.674 [59,18,7,2]
> > > > > > >> > pg_temp 4.7b8 [164,68,10,11]
> > > > > > >> > pg_temp 4.816 [167,147,57,2]
> > > > > > >> > pg_temp 4.829 [82,45,10,11]
> > > > > > >> > pg_temp 4.843 [141,34,10,6]
> > > > > > >> > pg_temp 4.862 [31,160,138,2]
> > > > > > >> > pg_temp 4.868 [78,67,10,5]
> > > > > > >> > pg_temp 4.9ca [150,68,10,8]
> > > > > > >> > pg_temp 4.a83 [156,83,10,7]
> > > > > > >> > pg_temp 4.a98 [161,94,10,7]
> > > > > > >> > pg_temp 4.b80 [162,88,10,8]
> > > > > > >> > pg_temp 4.d41 [163,52,10,6]
> > > > > > >> > pg_temp 4.d54 [149,140,10,7]
> > > > > > >> > pg_temp 4.e8e [164,78,10,8]
> > > > > > >> > pg_temp 4.f2a [150,68,10,6]
> > > > > > >> > pg_temp 4.ff3 [30,157,10,7]
> > > > > > >> > ***@mon1:~#
> > > > > > >> >
> > > > > > >> > So I tried to restart osd.160 and osd.161, but that didn't chance the
> > > > > > >> state.
> > > > > > >> >
> > > > > > >> > ***@mon1:~# ceph pg 4.39 query
> > > > > > >> > {
> > > > > > >> > "state": "active+clean",
> > > > > > >> > "snap_trimq": "[]",
> > > > > > >> > "epoch": 111212,
> > > > > > >> > "up": [
> > > > > > >> > 160,
> > > > > > >> > 17,
> > > > > > >> > 8
> > > > > > >> > ],
> > > > > > >> > "acting": [
> > > > > > >> > 160,
> > > > > > >> > 17,
> > > > > > >> > 8
> > > > > > >> > ],
> > > > > > >> > "actingbackfill": [
> > > > > > >> > "8",
> > > > > > >> > "17",
> > > > > > >> > "160"
> > > > > > >> > ],
> > > > > > >> >
> > > > > > >> > In all these PGs osd.10 is involved, but that OSD is down and out. I
> > > > > > >> tried marking it as down again, but that didn't work.
> > > > > > >> >
> > > > > > >> > I haven't tried removing osd.10 yet from the CRUSHMap since that will
> > > > > > >> trigger a rather large rebalance.
> > > > > > >> >
> > > > > > >> > This cluster is still running with the Dumpling tunables though, so that
> > > > > > >> might be the issue. But before I trigger a very large rebalance I wanted to
> > > > > > >> check if there are any insights on this one.
> > > > > > >> >
> > > > > > >> > Thanks,
> > > > > > >> >
> > > > > > >> > Wido
> > > > > > >> > _______________________________________________
> > > > > > >> > ceph-users mailing list
> > > > > > >> > ceph-***@lists.ceph.com
> > > > > > >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > > > > _______________________________________________
> > > > > > ceph-users mailing list
> > > > > > ceph-***@lists.ceph.com
> > > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > > > >
> > > > > >
> > > > _______________________________________________
> > > > ceph-users mailing list
> > > > ceph-***@lists.ceph.com
> > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >
> > >
>
>
Wido den Hollander
2016-11-02 14:28:05 UTC
Permalink
> Op 2 november 2016 om 15:06 schreef Sage Weil <***@newdream.net>:
>
>
> On Wed, 2 Nov 2016, Wido den Hollander wrote:
> >
> > > Op 2 november 2016 om 14:30 schreef Sage Weil <***@newdream.net>:
> > >
> > >
> > > On Wed, 2 Nov 2016, Wido den Hollander wrote:
> > > >
> > > > > Op 26 oktober 2016 om 11:18 schreef Wido den Hollander <***@42on.com>:
> > > > >
> > > > >
> > > > >
> > > > > > Op 26 oktober 2016 om 10:44 schreef Sage Weil <***@newdream.net>:
> > > > > >
> > > > > >
> > > > > > On Wed, 26 Oct 2016, Dan van der Ster wrote:
> > > > > > > On Tue, Oct 25, 2016 at 7:06 AM, Wido den Hollander <***@42on.com> wrote:
> > > > > > > >
> > > > > > > >> Op 24 oktober 2016 om 22:29 schreef Dan van der Ster <***@vanderster.com>:
> > > > > > > >>
> > > > > > > >>
> > > > > > > >> Hi Wido,
> > > > > > > >>
> > > > > > > >> This seems similar to what our dumpling tunables cluster does when a few
> > > > > > > >> particular osds go down... Though in our case the remapped pgs are
> > > > > > > >> correctly shown as remapped, not clean.
> > > > > > > >>
> > > > > > > >> The fix in our case will be to enable the vary_r tunable (which will move
> > > > > > > >> some data).
> > > > > > > >>
> > > > > > > >
> > > > > > > > Ah, as I figured. I will probably apply the Firefly tunables here. This cluster was upgraded from Dumping to Firefly and to Hammer recently and we didn't change the tunables yet.
> > > > > > > >
> > > > > > > > The MON stores are 35GB each right now and I think they are not trimming due to the pg_temp which still exists.
> > > > > > > >
> > > > > > > > I'll report back later, but this rebalance will take a lot of time.
> > > > > > >
> > > > > > > I forgot to mention, a workaround for the vary_r issue is to simply
> > > > > > > remove the down/out osd from the crush map. We just hit this issue
> > > > > > > again last night on a failed osd and after removing it from the crush
> > > > > > > map the last degraded PG started backfilling.
> > > > > >
> > > > > > Also note that if you do enable vary_r, you can set it to a higher value
> > > > > > (like 5) to get the benefit without moving as much existing data. See the
> > > > > > CRUSH tunable docs for more details!
> > > > > >
> > > > >
> > > > > Yes, thanks. So with the input here we have a few options and are deciding which routes to take.
> > > > >
> > > > > The cluster is rather old (hw as well), so we have to be careful at this time. For the record, our options are:
> > > > >
> > > > > - vary_r to 1: 73% misplaced
> > > > > - vary_r to 2 ~ 4: Looking into it
> > > > > - Removing dead OSDs from CRUSH
> > > > >
> > > > > As the cluster is under some stress we have to do this in the weekends, that makes it a bit difficult, but nothing we can't overcome.
> > > > >
> > > > > Thanks again for the input and I'll report on what we did later on.
> > > > >
> > > >
> > > > So, what I did:
> > > > - Remove all dead OSDs from the CRUSHMap and OSDMap
> > > > - Set vary_r to 2
> > > >
> > > > This resulted in:
> > > >
> > > > osdmap e119647: 169 osds: 166 up, 166 in; 6 remapped pgs
> > > >
> > > > pg_temp 4.39 [160,17,10,8]
> > > > pg_temp 4.2c9 [164,95,10,7]
> > > > pg_temp 4.816 [167,147,57,2]
> > > > pg_temp 4.862 [31,160,138,2]
> > > > pg_temp 4.a83 [156,83,10,7]
> > > > pg_temp 4.e8e [164,78,10,8]
> > > >
> > > > In this case, osd 2 and 10 no longer exist, not in the OSDMap nor in the CRUSHMap.
> > > >
> > > > ***@mon1:~# ceph osd metadata 2
> > > > Error ENOENT: osd.2 does not exist
> > > > ***@mon1:~# ceph osd metadata 10
> > > > Error ENOENT: osd.10 does not exist
> > > > ***@mon1:~# ceph osd find 2
> > > > Error ENOENT: osd.2 does not exist
> > > > ***@mon1:~# ceph osd find 10
> > > > Error ENOENT: osd.10 does not exist
> > > > ***@mon1:~#
> > > >
> > > > Looking at PG '4.39' for example, a query tells me:
> > > >
> > > > "up": [
> > > > 160,
> > > > 17,
> > > > 8
> > > > ],
> > > > "acting": [
> > > > 160,
> > > > 17,
> > > > 8
> > > > ],
> > > >
> > > > So I really wonder there the pg_temp with osd.10 comes from.
> > >
> > > Hmm.. are the others also the same like that? You can manually poke
> > > it into adjusting pg-temp with
> > >
> > > ceph osd pg_temp <pgid> <just the primary osd>
> > >
> > > That'll make peering reevaluate what pg_temp it wants (if any). It might
> > > be that it isn't noticing that pg_temp matches acting.. but the mon has
> > > special code to remove those entries, so hrm. Is this hammer?
> > >
> >
> > So yes, that worked. I did it for 3 PGs:
> >
> > # ceph osd pg-temp 4.39 160
> > # ceph osd pg-temp 4.2c9 164
> > # ceph osd pg-temp 4.816 167
> >
> > Now my pg_temp looks like:
> >
> > pg_temp 4.862 [31,160,138,2]
> > pg_temp 4.a83 [156,83,10,7]
> > pg_temp 4.e8e [164,78,10,8]
> >
> > There we see the osd.2 and osd.10 again. I'm not setting these yet since you might want logs from the MONs or OSDs?
> >
> > This is Hammer 0.94.9
>
> I'm pretty sure this is a race condition that got cleaned up as part of
> https://github.com/ceph/ceph/pull/9078/commits. The mon only checks the
> pg_temp entries that are getting set/changed, and since those are already
> in place it doesn't recheck them. Any poke to the cluster that triggers
> peering ought to be enough to clear it up. So, no need for logs, thanks!
>

Ok, just checking.

> We could add a special check during, say, upgrade, but generally the PGs
> will re-peer as the OSDs restart anyway and that will clear it up.
>
> Maybe you can just confirm that marking an osd down (say, ceph osd down
> 31) is also enough to remove the stray entry?
>

I already tried a restart of the OSDs, but that didn't work. I marked osd 31, 160 and 138 as down for PG 4.862 but that didn't work:

pg_temp 4.862 [31,160,138,2]

But this works:

***@mon1:~# ceph osd dump|grep pg_temp
pg_temp 4.862 [31,160,138,2]
pg_temp 4.a83 [156,83,10,7]
pg_temp 4.e8e [164,78,10,8]
***@mon1:~# ceph osd pg-temp 4.862 31
set 4.862 pg_temp mapping to [31]
***@mon1:~# ceph osd dump|grep pg_temp
pg_temp 4.a83 [156,83,10,7]
pg_temp 4.e8e [164,78,10,8]
***@mon1:~#

So the restarts nor the marking down fixed the issue. Only the pg-temp trick.

Still have two PGs left which I can test with.

Wido

> Thanks!
> sage
>
> >
> > > > Setting vary_r to 1 will result in a 76% degraded state for the cluster
> > > > and I'm trying to avoid that (for now).
> > > >
> > > > I restarted the Primary OSDs for all the affected PGs, but that didn't
> > > > help either.
> > > >
> > > > Any bright ideas on how to fix this?
> > >
> > > This part seems unrelated to vary_r... you shouldn't have to
> > > reduce it further!
> > >
> >
> > Indeed, like you said, the pg_temp fixed it for 3 PGs already. Holding off with the rest in case you want logs or debug it further.
> >
> > Wido
> >
> > > sage
> > >
> > >
> > > >
> > > > Wido
> > > >
> > > > > Wido
> > > > >
> > > > > > sage
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > Cheers, Dan
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > Wido
> > > > > > > >
> > > > > > > >> Cheers, Dan
> > > > > > > >>
> > > > > > > >> On 24 Oct 2016 22:19, "Wido den Hollander" <***@42on.com> wrote:
> > > > > > > >> >
> > > > > > > >> > Hi,
> > > > > > > >> >
> > > > > > > >> > On a cluster running Hammer 0.94.9 (upgraded from Firefly) I have 29
> > > > > > > >> remapped PGs according to the OSDMap, but all PGs are active+clean.
> > > > > > > >> >
> > > > > > > >> > osdmap e111208: 171 osds: 166 up, 166 in; 29 remapped pgs
> > > > > > > >> >
> > > > > > > >> > pgmap v101069070: 6144 pgs, 2 pools, 90122 GB data, 22787 kobjects
> > > > > > > >> > 264 TB used, 184 TB / 448 TB avail
> > > > > > > >> > 6144 active+clean
> > > > > > > >> >
> > > > > > > >> > The OSDMap shows:
> > > > > > > >> >
> > > > > > > >> > ***@mon1:~# ceph osd dump|grep pg_temp
> > > > > > > >> > pg_temp 4.39 [160,17,10,8]
> > > > > > > >> > pg_temp 4.52 [161,16,10,11]
> > > > > > > >> > pg_temp 4.8b [166,29,10,7]
> > > > > > > >> > pg_temp 4.b1 [5,162,148,2]
> > > > > > > >> > pg_temp 4.168 [95,59,6,2]
> > > > > > > >> > pg_temp 4.1ef [22,162,10,5]
> > > > > > > >> > pg_temp 4.2c9 [164,95,10,7]
> > > > > > > >> > pg_temp 4.330 [165,154,10,8]
> > > > > > > >> > pg_temp 4.353 [2,33,18,54]
> > > > > > > >> > pg_temp 4.3f8 [88,67,10,7]
> > > > > > > >> > pg_temp 4.41a [30,59,10,5]
> > > > > > > >> > pg_temp 4.45f [47,156,21,2]
> > > > > > > >> > pg_temp 4.486 [138,43,10,7]
> > > > > > > >> > pg_temp 4.674 [59,18,7,2]
> > > > > > > >> > pg_temp 4.7b8 [164,68,10,11]
> > > > > > > >> > pg_temp 4.816 [167,147,57,2]
> > > > > > > >> > pg_temp 4.829 [82,45,10,11]
> > > > > > > >> > pg_temp 4.843 [141,34,10,6]
> > > > > > > >> > pg_temp 4.862 [31,160,138,2]
> > > > > > > >> > pg_temp 4.868 [78,67,10,5]
> > > > > > > >> > pg_temp 4.9ca [150,68,10,8]
> > > > > > > >> > pg_temp 4.a83 [156,83,10,7]
> > > > > > > >> > pg_temp 4.a98 [161,94,10,7]
> > > > > > > >> > pg_temp 4.b80 [162,88,10,8]
> > > > > > > >> > pg_temp 4.d41 [163,52,10,6]
> > > > > > > >> > pg_temp 4.d54 [149,140,10,7]
> > > > > > > >> > pg_temp 4.e8e [164,78,10,8]
> > > > > > > >> > pg_temp 4.f2a [150,68,10,6]
> > > > > > > >> > pg_temp 4.ff3 [30,157,10,7]
> > > > > > > >> > ***@mon1:~#
> > > > > > > >> >
> > > > > > > >> > So I tried to restart osd.160 and osd.161, but that didn't chance the
> > > > > > > >> state.
> > > > > > > >> >
> > > > > > > >> > ***@mon1:~# ceph pg 4.39 query
> > > > > > > >> > {
> > > > > > > >> > "state": "active+clean",
> > > > > > > >> > "snap_trimq": "[]",
> > > > > > > >> > "epoch": 111212,
> > > > > > > >> > "up": [
> > > > > > > >> > 160,
> > > > > > > >> > 17,
> > > > > > > >> > 8
> > > > > > > >> > ],
> > > > > > > >> > "acting": [
> > > > > > > >> > 160,
> > > > > > > >> > 17,
> > > > > > > >> > 8
> > > > > > > >> > ],
> > > > > > > >> > "actingbackfill": [
> > > > > > > >> > "8",
> > > > > > > >> > "17",
> > > > > > > >> > "160"
> > > > > > > >> > ],
> > > > > > > >> >
> > > > > > > >> > In all these PGs osd.10 is involved, but that OSD is down and out. I
> > > > > > > >> tried marking it as down again, but that didn't work.
> > > > > > > >> >
> > > > > > > >> > I haven't tried removing osd.10 yet from the CRUSHMap since that will
> > > > > > > >> trigger a rather large rebalance.
> > > > > > > >> >
> > > > > > > >> > This cluster is still running with the Dumpling tunables though, so that
> > > > > > > >> might be the issue. But before I trigger a very large rebalance I wanted to
> > > > > > > >> check if there are any insights on this one.
> > > > > > > >> >
> > > > > > > >> > Thanks,
> > > > > > > >> >
> > > > > > > >> > Wido
> > > > > > > >> > _______________________________________________
> > > > > > > >> > ceph-users mailing list
> > > > > > > >> > ceph-***@lists.ceph.com
> > > > > > > >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > > > > > _______________________________________________
> > > > > > > ceph-users mailing list
> > > > > > > ceph-***@lists.ceph.com
> > > > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > > > > >
> > > > > > >
> > > > > _______________________________________________
> > > > > ceph-users mailing list
> > > > > ceph-***@lists.ceph.com
> > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > >
> > > >
> >
> >
Sage Weil
2016-11-02 15:00:36 UTC
Permalink
On Wed, 2 Nov 2016, Wido den Hollander wrote:
> > Op 2 november 2016 om 15:06 schreef Sage Weil <***@newdream.net>:
> >
> >
> > On Wed, 2 Nov 2016, Wido den Hollander wrote:
> > >
> > > > Op 2 november 2016 om 14:30 schreef Sage Weil <***@newdream.net>:
> > > >
> > > >
> > > > On Wed, 2 Nov 2016, Wido den Hollander wrote:
> > > > >
> > > > > > Op 26 oktober 2016 om 11:18 schreef Wido den Hollander <***@42on.com>:
> > > > > >
> > > > > >
> > > > > >
> > > > > > > Op 26 oktober 2016 om 10:44 schreef Sage Weil <***@newdream.net>:
> > > > > > >
> > > > > > >
> > > > > > > On Wed, 26 Oct 2016, Dan van der Ster wrote:
> > > > > > > > On Tue, Oct 25, 2016 at 7:06 AM, Wido den Hollander <***@42on.com> wrote:
> > > > > > > > >
> > > > > > > > >> Op 24 oktober 2016 om 22:29 schreef Dan van der Ster <***@vanderster.com>:
> > > > > > > > >>
> > > > > > > > >>
> > > > > > > > >> Hi Wido,
> > > > > > > > >>
> > > > > > > > >> This seems similar to what our dumpling tunables cluster does when a few
> > > > > > > > >> particular osds go down... Though in our case the remapped pgs are
> > > > > > > > >> correctly shown as remapped, not clean.
> > > > > > > > >>
> > > > > > > > >> The fix in our case will be to enable the vary_r tunable (which will move
> > > > > > > > >> some data).
> > > > > > > > >>
> > > > > > > > >
> > > > > > > > > Ah, as I figured. I will probably apply the Firefly tunables here. This cluster was upgraded from Dumping to Firefly and to Hammer recently and we didn't change the tunables yet.
> > > > > > > > >
> > > > > > > > > The MON stores are 35GB each right now and I think they are not trimming due to the pg_temp which still exists.
> > > > > > > > >
> > > > > > > > > I'll report back later, but this rebalance will take a lot of time.
> > > > > > > >
> > > > > > > > I forgot to mention, a workaround for the vary_r issue is to simply
> > > > > > > > remove the down/out osd from the crush map. We just hit this issue
> > > > > > > > again last night on a failed osd and after removing it from the crush
> > > > > > > > map the last degraded PG started backfilling.
> > > > > > >
> > > > > > > Also note that if you do enable vary_r, you can set it to a higher value
> > > > > > > (like 5) to get the benefit without moving as much existing data. See the
> > > > > > > CRUSH tunable docs for more details!
> > > > > > >
> > > > > >
> > > > > > Yes, thanks. So with the input here we have a few options and are deciding which routes to take.
> > > > > >
> > > > > > The cluster is rather old (hw as well), so we have to be careful at this time. For the record, our options are:
> > > > > >
> > > > > > - vary_r to 1: 73% misplaced
> > > > > > - vary_r to 2 ~ 4: Looking into it
> > > > > > - Removing dead OSDs from CRUSH
> > > > > >
> > > > > > As the cluster is under some stress we have to do this in the weekends, that makes it a bit difficult, but nothing we can't overcome.
> > > > > >
> > > > > > Thanks again for the input and I'll report on what we did later on.
> > > > > >
> > > > >
> > > > > So, what I did:
> > > > > - Remove all dead OSDs from the CRUSHMap and OSDMap
> > > > > - Set vary_r to 2
> > > > >
> > > > > This resulted in:
> > > > >
> > > > > osdmap e119647: 169 osds: 166 up, 166 in; 6 remapped pgs
> > > > >
> > > > > pg_temp 4.39 [160,17,10,8]
> > > > > pg_temp 4.2c9 [164,95,10,7]
> > > > > pg_temp 4.816 [167,147,57,2]
> > > > > pg_temp 4.862 [31,160,138,2]
> > > > > pg_temp 4.a83 [156,83,10,7]
> > > > > pg_temp 4.e8e [164,78,10,8]
> > > > >
> > > > > In this case, osd 2 and 10 no longer exist, not in the OSDMap nor in the CRUSHMap.
> > > > >
> > > > > ***@mon1:~# ceph osd metadata 2
> > > > > Error ENOENT: osd.2 does not exist
> > > > > ***@mon1:~# ceph osd metadata 10
> > > > > Error ENOENT: osd.10 does not exist
> > > > > ***@mon1:~# ceph osd find 2
> > > > > Error ENOENT: osd.2 does not exist
> > > > > ***@mon1:~# ceph osd find 10
> > > > > Error ENOENT: osd.10 does not exist
> > > > > ***@mon1:~#
> > > > >
> > > > > Looking at PG '4.39' for example, a query tells me:
> > > > >
> > > > > "up": [
> > > > > 160,
> > > > > 17,
> > > > > 8
> > > > > ],
> > > > > "acting": [
> > > > > 160,
> > > > > 17,
> > > > > 8
> > > > > ],
> > > > >
> > > > > So I really wonder there the pg_temp with osd.10 comes from.
> > > >
> > > > Hmm.. are the others also the same like that? You can manually poke
> > > > it into adjusting pg-temp with
> > > >
> > > > ceph osd pg_temp <pgid> <just the primary osd>
> > > >
> > > > That'll make peering reevaluate what pg_temp it wants (if any). It might
> > > > be that it isn't noticing that pg_temp matches acting.. but the mon has
> > > > special code to remove those entries, so hrm. Is this hammer?
> > > >
> > >
> > > So yes, that worked. I did it for 3 PGs:
> > >
> > > # ceph osd pg-temp 4.39 160
> > > # ceph osd pg-temp 4.2c9 164
> > > # ceph osd pg-temp 4.816 167
> > >
> > > Now my pg_temp looks like:
> > >
> > > pg_temp 4.862 [31,160,138,2]
> > > pg_temp 4.a83 [156,83,10,7]
> > > pg_temp 4.e8e [164,78,10,8]
> > >
> > > There we see the osd.2 and osd.10 again. I'm not setting these yet since you might want logs from the MONs or OSDs?
> > >
> > > This is Hammer 0.94.9
> >
> > I'm pretty sure this is a race condition that got cleaned up as part of
> > https://github.com/ceph/ceph/pull/9078/commits. The mon only checks the
> > pg_temp entries that are getting set/changed, and since those are already
> > in place it doesn't recheck them. Any poke to the cluster that triggers
> > peering ought to be enough to clear it up. So, no need for logs, thanks!
> >
>
> Ok, just checking.
>
> > We could add a special check during, say, upgrade, but generally the PGs
> > will re-peer as the OSDs restart anyway and that will clear it up.
> >
> > Maybe you can just confirm that marking an osd down (say, ceph osd down
> > 31) is also enough to remove the stray entry?
> >
>
> I already tried a restart of the OSDs, but that didn't work. I marked osd 31, 160 and 138 as down for PG 4.862 but that didn't work:
>
> pg_temp 4.862 [31,160,138,2]
>
> But this works:
>
> ***@mon1:~# ceph osd dump|grep pg_temp
> pg_temp 4.862 [31,160,138,2]
> pg_temp 4.a83 [156,83,10,7]
> pg_temp 4.e8e [164,78,10,8]
> ***@mon1:~# ceph osd pg-temp 4.862 31
> set 4.862 pg_temp mapping to [31]
> ***@mon1:~# ceph osd dump|grep pg_temp
> pg_temp 4.a83 [156,83,10,7]
> pg_temp 4.e8e [164,78,10,8]
> ***@mon1:~#
>
> So the restarts nor the marking down fixed the issue. Only the pg-temp trick.
>
> Still have two PGs left which I can test with.

Hmm. Did you leave the OSD down long enough for the PG to peer without
it? Can you confirm that doesn't work?

Thanks!
s


>
> Wido
>
> > Thanks!
> > sage
> >
> > >
> > > > > Setting vary_r to 1 will result in a 76% degraded state for the cluster
> > > > > and I'm trying to avoid that (for now).
> > > > >
> > > > > I restarted the Primary OSDs for all the affected PGs, but that didn't
> > > > > help either.
> > > > >
> > > > > Any bright ideas on how to fix this?
> > > >
> > > > This part seems unrelated to vary_r... you shouldn't have to
> > > > reduce it further!
> > > >
> > >
> > > Indeed, like you said, the pg_temp fixed it for 3 PGs already. Holding off with the rest in case you want logs or debug it further.
> > >
> > > Wido
> > >
> > > > sage
> > > >
> > > >
> > > > >
> > > > > Wido
> > > > >
> > > > > > Wido
> > > > > >
> > > > > > > sage
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > Cheers, Dan
> > > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > Wido
> > > > > > > > >
> > > > > > > > >> Cheers, Dan
> > > > > > > > >>
> > > > > > > > >> On 24 Oct 2016 22:19, "Wido den Hollander" <***@42on.com> wrote:
> > > > > > > > >> >
> > > > > > > > >> > Hi,
> > > > > > > > >> >
> > > > > > > > >> > On a cluster running Hammer 0.94.9 (upgraded from Firefly) I have 29
> > > > > > > > >> remapped PGs according to the OSDMap, but all PGs are active+clean.
> > > > > > > > >> >
> > > > > > > > >> > osdmap e111208: 171 osds: 166 up, 166 in; 29 remapped pgs
> > > > > > > > >> >
> > > > > > > > >> > pgmap v101069070: 6144 pgs, 2 pools, 90122 GB data, 22787 kobjects
> > > > > > > > >> > 264 TB used, 184 TB / 448 TB avail
> > > > > > > > >> > 6144 active+clean
> > > > > > > > >> >
> > > > > > > > >> > The OSDMap shows:
> > > > > > > > >> >
> > > > > > > > >> > ***@mon1:~# ceph osd dump|grep pg_temp
> > > > > > > > >> > pg_temp 4.39 [160,17,10,8]
> > > > > > > > >> > pg_temp 4.52 [161,16,10,11]
> > > > > > > > >> > pg_temp 4.8b [166,29,10,7]
> > > > > > > > >> > pg_temp 4.b1 [5,162,148,2]
> > > > > > > > >> > pg_temp 4.168 [95,59,6,2]
> > > > > > > > >> > pg_temp 4.1ef [22,162,10,5]
> > > > > > > > >> > pg_temp 4.2c9 [164,95,10,7]
> > > > > > > > >> > pg_temp 4.330 [165,154,10,8]
> > > > > > > > >> > pg_temp 4.353 [2,33,18,54]
> > > > > > > > >> > pg_temp 4.3f8 [88,67,10,7]
> > > > > > > > >> > pg_temp 4.41a [30,59,10,5]
> > > > > > > > >> > pg_temp 4.45f [47,156,21,2]
> > > > > > > > >> > pg_temp 4.486 [138,43,10,7]
> > > > > > > > >> > pg_temp 4.674 [59,18,7,2]
> > > > > > > > >> > pg_temp 4.7b8 [164,68,10,11]
> > > > > > > > >> > pg_temp 4.816 [167,147,57,2]
> > > > > > > > >> > pg_temp 4.829 [82,45,10,11]
> > > > > > > > >> > pg_temp 4.843 [141,34,10,6]
> > > > > > > > >> > pg_temp 4.862 [31,160,138,2]
> > > > > > > > >> > pg_temp 4.868 [78,67,10,5]
> > > > > > > > >> > pg_temp 4.9ca [150,68,10,8]
> > > > > > > > >> > pg_temp 4.a83 [156,83,10,7]
> > > > > > > > >> > pg_temp 4.a98 [161,94,10,7]
> > > > > > > > >> > pg_temp 4.b80 [162,88,10,8]
> > > > > > > > >> > pg_temp 4.d41 [163,52,10,6]
> > > > > > > > >> > pg_temp 4.d54 [149,140,10,7]
> > > > > > > > >> > pg_temp 4.e8e [164,78,10,8]
> > > > > > > > >> > pg_temp 4.f2a [150,68,10,6]
> > > > > > > > >> > pg_temp 4.ff3 [30,157,10,7]
> > > > > > > > >> > ***@mon1:~#
> > > > > > > > >> >
> > > > > > > > >> > So I tried to restart osd.160 and osd.161, but that didn't chance the
> > > > > > > > >> state.
> > > > > > > > >> >
> > > > > > > > >> > ***@mon1:~# ceph pg 4.39 query
> > > > > > > > >> > {
> > > > > > > > >> > "state": "active+clean",
> > > > > > > > >> > "snap_trimq": "[]",
> > > > > > > > >> > "epoch": 111212,
> > > > > > > > >> > "up": [
> > > > > > > > >> > 160,
> > > > > > > > >> > 17,
> > > > > > > > >> > 8
> > > > > > > > >> > ],
> > > > > > > > >> > "acting": [
> > > > > > > > >> > 160,
> > > > > > > > >> > 17,
> > > > > > > > >> > 8
> > > > > > > > >> > ],
> > > > > > > > >> > "actingbackfill": [
> > > > > > > > >> > "8",
> > > > > > > > >> > "17",
> > > > > > > > >> > "160"
> > > > > > > > >> > ],
> > > > > > > > >> >
> > > > > > > > >> > In all these PGs osd.10 is involved, but that OSD is down and out. I
> > > > > > > > >> tried marking it as down again, but that didn't work.
> > > > > > > > >> >
> > > > > > > > >> > I haven't tried removing osd.10 yet from the CRUSHMap since that will
> > > > > > > > >> trigger a rather large rebalance.
> > > > > > > > >> >
> > > > > > > > >> > This cluster is still running with the Dumpling tunables though, so that
> > > > > > > > >> might be the issue. But before I trigger a very large rebalance I wanted to
> > > > > > > > >> check if there are any insights on this one.
> > > > > > > > >> >
> > > > > > > > >> > Thanks,
> > > > > > > > >> >
> > > > > > > > >> > Wido
> > > > > > > > >> > _______________________________________________
> > > > > > > > >> > ceph-users mailing list
> > > > > > > > >> > ceph-***@lists.ceph.com
> > > > > > > > >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > > > > > > _______________________________________________
> > > > > > > > ceph-users mailing list
> > > > > > > > ceph-***@lists.ceph.com
> > > > > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > > > > > >
> > > > > > > >
> > > > > > _______________________________________________
> > > > > > ceph-users mailing list
> > > > > > ceph-***@lists.ceph.com
> > > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > > >
> > > > >
> > >
> > >
>
>
Wido den Hollander
2016-11-02 15:13:52 UTC
Permalink
> Op 2 november 2016 om 16:00 schreef Sage Weil <***@newdream.net>:
>
>
> On Wed, 2 Nov 2016, Wido den Hollander wrote:
> > > Op 2 november 2016 om 15:06 schreef Sage Weil <***@newdream.net>:
> > >
> > >
> > > On Wed, 2 Nov 2016, Wido den Hollander wrote:
> > > >
> > > > > Op 2 november 2016 om 14:30 schreef Sage Weil <***@newdream.net>:
> > > > >
> > > > >
> > > > > On Wed, 2 Nov 2016, Wido den Hollander wrote:
> > > > > >
> > > > > > > Op 26 oktober 2016 om 11:18 schreef Wido den Hollander <***@42on.com>:
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > > Op 26 oktober 2016 om 10:44 schreef Sage Weil <***@newdream.net>:
> > > > > > > >
> > > > > > > >
> > > > > > > > On Wed, 26 Oct 2016, Dan van der Ster wrote:
> > > > > > > > > On Tue, Oct 25, 2016 at 7:06 AM, Wido den Hollander <***@42on.com> wrote:
> > > > > > > > > >
> > > > > > > > > >> Op 24 oktober 2016 om 22:29 schreef Dan van der Ster <***@vanderster.com>:
> > > > > > > > > >>
> > > > > > > > > >>
> > > > > > > > > >> Hi Wido,
> > > > > > > > > >>
> > > > > > > > > >> This seems similar to what our dumpling tunables cluster does when a few
> > > > > > > > > >> particular osds go down... Though in our case the remapped pgs are
> > > > > > > > > >> correctly shown as remapped, not clean.
> > > > > > > > > >>
> > > > > > > > > >> The fix in our case will be to enable the vary_r tunable (which will move
> > > > > > > > > >> some data).
> > > > > > > > > >>
> > > > > > > > > >
> > > > > > > > > > Ah, as I figured. I will probably apply the Firefly tunables here. This cluster was upgraded from Dumping to Firefly and to Hammer recently and we didn't change the tunables yet.
> > > > > > > > > >
> > > > > > > > > > The MON stores are 35GB each right now and I think they are not trimming due to the pg_temp which still exists.
> > > > > > > > > >
> > > > > > > > > > I'll report back later, but this rebalance will take a lot of time.
> > > > > > > > >
> > > > > > > > > I forgot to mention, a workaround for the vary_r issue is to simply
> > > > > > > > > remove the down/out osd from the crush map. We just hit this issue
> > > > > > > > > again last night on a failed osd and after removing it from the crush
> > > > > > > > > map the last degraded PG started backfilling.
> > > > > > > >
> > > > > > > > Also note that if you do enable vary_r, you can set it to a higher value
> > > > > > > > (like 5) to get the benefit without moving as much existing data. See the
> > > > > > > > CRUSH tunable docs for more details!
> > > > > > > >
> > > > > > >
> > > > > > > Yes, thanks. So with the input here we have a few options and are deciding which routes to take.
> > > > > > >
> > > > > > > The cluster is rather old (hw as well), so we have to be careful at this time. For the record, our options are:
> > > > > > >
> > > > > > > - vary_r to 1: 73% misplaced
> > > > > > > - vary_r to 2 ~ 4: Looking into it
> > > > > > > - Removing dead OSDs from CRUSH
> > > > > > >
> > > > > > > As the cluster is under some stress we have to do this in the weekends, that makes it a bit difficult, but nothing we can't overcome.
> > > > > > >
> > > > > > > Thanks again for the input and I'll report on what we did later on.
> > > > > > >
> > > > > >
> > > > > > So, what I did:
> > > > > > - Remove all dead OSDs from the CRUSHMap and OSDMap
> > > > > > - Set vary_r to 2
> > > > > >
> > > > > > This resulted in:
> > > > > >
> > > > > > osdmap e119647: 169 osds: 166 up, 166 in; 6 remapped pgs
> > > > > >
> > > > > > pg_temp 4.39 [160,17,10,8]
> > > > > > pg_temp 4.2c9 [164,95,10,7]
> > > > > > pg_temp 4.816 [167,147,57,2]
> > > > > > pg_temp 4.862 [31,160,138,2]
> > > > > > pg_temp 4.a83 [156,83,10,7]
> > > > > > pg_temp 4.e8e [164,78,10,8]
> > > > > >
> > > > > > In this case, osd 2 and 10 no longer exist, not in the OSDMap nor in the CRUSHMap.
> > > > > >
> > > > > > ***@mon1:~# ceph osd metadata 2
> > > > > > Error ENOENT: osd.2 does not exist
> > > > > > ***@mon1:~# ceph osd metadata 10
> > > > > > Error ENOENT: osd.10 does not exist
> > > > > > ***@mon1:~# ceph osd find 2
> > > > > > Error ENOENT: osd.2 does not exist
> > > > > > ***@mon1:~# ceph osd find 10
> > > > > > Error ENOENT: osd.10 does not exist
> > > > > > ***@mon1:~#
> > > > > >
> > > > > > Looking at PG '4.39' for example, a query tells me:
> > > > > >
> > > > > > "up": [
> > > > > > 160,
> > > > > > 17,
> > > > > > 8
> > > > > > ],
> > > > > > "acting": [
> > > > > > 160,
> > > > > > 17,
> > > > > > 8
> > > > > > ],
> > > > > >
> > > > > > So I really wonder there the pg_temp with osd.10 comes from.
> > > > >
> > > > > Hmm.. are the others also the same like that? You can manually poke
> > > > > it into adjusting pg-temp with
> > > > >
> > > > > ceph osd pg_temp <pgid> <just the primary osd>
> > > > >
> > > > > That'll make peering reevaluate what pg_temp it wants (if any). It might
> > > > > be that it isn't noticing that pg_temp matches acting.. but the mon has
> > > > > special code to remove those entries, so hrm. Is this hammer?
> > > > >
> > > >
> > > > So yes, that worked. I did it for 3 PGs:
> > > >
> > > > # ceph osd pg-temp 4.39 160
> > > > # ceph osd pg-temp 4.2c9 164
> > > > # ceph osd pg-temp 4.816 167
> > > >
> > > > Now my pg_temp looks like:
> > > >
> > > > pg_temp 4.862 [31,160,138,2]
> > > > pg_temp 4.a83 [156,83,10,7]
> > > > pg_temp 4.e8e [164,78,10,8]
> > > >
> > > > There we see the osd.2 and osd.10 again. I'm not setting these yet since you might want logs from the MONs or OSDs?
> > > >
> > > > This is Hammer 0.94.9
> > >
> > > I'm pretty sure this is a race condition that got cleaned up as part of
> > > https://github.com/ceph/ceph/pull/9078/commits. The mon only checks the
> > > pg_temp entries that are getting set/changed, and since those are already
> > > in place it doesn't recheck them. Any poke to the cluster that triggers
> > > peering ought to be enough to clear it up. So, no need for logs, thanks!
> > >
> >
> > Ok, just checking.
> >
> > > We could add a special check during, say, upgrade, but generally the PGs
> > > will re-peer as the OSDs restart anyway and that will clear it up.
> > >
> > > Maybe you can just confirm that marking an osd down (say, ceph osd down
> > > 31) is also enough to remove the stray entry?
> > >
> >
> > I already tried a restart of the OSDs, but that didn't work. I marked osd 31, 160 and 138 as down for PG 4.862 but that didn't work:
> >
> > pg_temp 4.862 [31,160,138,2]
> >
> > But this works:
> >
> > ***@mon1:~# ceph osd dump|grep pg_temp
> > pg_temp 4.862 [31,160,138,2]
> > pg_temp 4.a83 [156,83,10,7]
> > pg_temp 4.e8e [164,78,10,8]
> > ***@mon1:~# ceph osd pg-temp 4.862 31
> > set 4.862 pg_temp mapping to [31]
> > ***@mon1:~# ceph osd dump|grep pg_temp
> > pg_temp 4.a83 [156,83,10,7]
> > pg_temp 4.e8e [164,78,10,8]
> > ***@mon1:~#
> >
> > So the restarts nor the marking down fixed the issue. Only the pg-temp trick.
> >
> > Still have two PGs left which I can test with.
>
> Hmm. Did you leave the OSD down long enough for the PG to peer without
> it? Can you confirm that doesn't work?
>

I stopped osd.31, waited for all PGs to re-peer, waited another minute or so and started it again, that didn't work. The pg_temp wasn't resolved.

The whole cluster runs 0.94.9

Wido

> Thanks!
> s
>
>
> >
> > Wido
> >
> > > Thanks!
> > > sage
> > >
> > > >
> > > > > > Setting vary_r to 1 will result in a 76% degraded state for the cluster
> > > > > > and I'm trying to avoid that (for now).
> > > > > >
> > > > > > I restarted the Primary OSDs for all the affected PGs, but that didn't
> > > > > > help either.
> > > > > >
> > > > > > Any bright ideas on how to fix this?
> > > > >
> > > > > This part seems unrelated to vary_r... you shouldn't have to
> > > > > reduce it further!
> > > > >
> > > >
> > > > Indeed, like you said, the pg_temp fixed it for 3 PGs already. Holding off with the rest in case you want logs or debug it further.
> > > >
> > > > Wido
> > > >
> > > > > sage
> > > > >
> > > > >
> > > > > >
> > > > > > Wido
> > > > > >
> > > > > > > Wido
> > > > > > >
> > > > > > > > sage
> > > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > Cheers, Dan
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Wido
> > > > > > > > > >
> > > > > > > > > >> Cheers, Dan
> > > > > > > > > >>
> > > > > > > > > >> On 24 Oct 2016 22:19, "Wido den Hollander" <***@42on.com> wrote:
> > > > > > > > > >> >
> > > > > > > > > >> > Hi,
> > > > > > > > > >> >
> > > > > > > > > >> > On a cluster running Hammer 0.94.9 (upgraded from Firefly) I have 29
> > > > > > > > > >> remapped PGs according to the OSDMap, but all PGs are active+clean.
> > > > > > > > > >> >
> > > > > > > > > >> > osdmap e111208: 171 osds: 166 up, 166 in; 29 remapped pgs
> > > > > > > > > >> >
> > > > > > > > > >> > pgmap v101069070: 6144 pgs, 2 pools, 90122 GB data, 22787 kobjects
> > > > > > > > > >> > 264 TB used, 184 TB / 448 TB avail
> > > > > > > > > >> > 6144 active+clean
> > > > > > > > > >> >
> > > > > > > > > >> > The OSDMap shows:
> > > > > > > > > >> >
> > > > > > > > > >> > ***@mon1:~# ceph osd dump|grep pg_temp
> > > > > > > > > >> > pg_temp 4.39 [160,17,10,8]
> > > > > > > > > >> > pg_temp 4.52 [161,16,10,11]
> > > > > > > > > >> > pg_temp 4.8b [166,29,10,7]
> > > > > > > > > >> > pg_temp 4.b1 [5,162,148,2]
> > > > > > > > > >> > pg_temp 4.168 [95,59,6,2]
> > > > > > > > > >> > pg_temp 4.1ef [22,162,10,5]
> > > > > > > > > >> > pg_temp 4.2c9 [164,95,10,7]
> > > > > > > > > >> > pg_temp 4.330 [165,154,10,8]
> > > > > > > > > >> > pg_temp 4.353 [2,33,18,54]
> > > > > > > > > >> > pg_temp 4.3f8 [88,67,10,7]
> > > > > > > > > >> > pg_temp 4.41a [30,59,10,5]
> > > > > > > > > >> > pg_temp 4.45f [47,156,21,2]
> > > > > > > > > >> > pg_temp 4.486 [138,43,10,7]
> > > > > > > > > >> > pg_temp 4.674 [59,18,7,2]
> > > > > > > > > >> > pg_temp 4.7b8 [164,68,10,11]
> > > > > > > > > >> > pg_temp 4.816 [167,147,57,2]
> > > > > > > > > >> > pg_temp 4.829 [82,45,10,11]
> > > > > > > > > >> > pg_temp 4.843 [141,34,10,6]
> > > > > > > > > >> > pg_temp 4.862 [31,160,138,2]
> > > > > > > > > >> > pg_temp 4.868 [78,67,10,5]
> > > > > > > > > >> > pg_temp 4.9ca [150,68,10,8]
> > > > > > > > > >> > pg_temp 4.a83 [156,83,10,7]
> > > > > > > > > >> > pg_temp 4.a98 [161,94,10,7]
> > > > > > > > > >> > pg_temp 4.b80 [162,88,10,8]
> > > > > > > > > >> > pg_temp 4.d41 [163,52,10,6]
> > > > > > > > > >> > pg_temp 4.d54 [149,140,10,7]
> > > > > > > > > >> > pg_temp 4.e8e [164,78,10,8]
> > > > > > > > > >> > pg_temp 4.f2a [150,68,10,6]
> > > > > > > > > >> > pg_temp 4.ff3 [30,157,10,7]
> > > > > > > > > >> > ***@mon1:~#
> > > > > > > > > >> >
> > > > > > > > > >> > So I tried to restart osd.160 and osd.161, but that didn't chance the
> > > > > > > > > >> state.
> > > > > > > > > >> >
> > > > > > > > > >> > ***@mon1:~# ceph pg 4.39 query
> > > > > > > > > >> > {
> > > > > > > > > >> > "state": "active+clean",
> > > > > > > > > >> > "snap_trimq": "[]",
> > > > > > > > > >> > "epoch": 111212,
> > > > > > > > > >> > "up": [
> > > > > > > > > >> > 160,
> > > > > > > > > >> > 17,
> > > > > > > > > >> > 8
> > > > > > > > > >> > ],
> > > > > > > > > >> > "acting": [
> > > > > > > > > >> > 160,
> > > > > > > > > >> > 17,
> > > > > > > > > >> > 8
> > > > > > > > > >> > ],
> > > > > > > > > >> > "actingbackfill": [
> > > > > > > > > >> > "8",
> > > > > > > > > >> > "17",
> > > > > > > > > >> > "160"
> > > > > > > > > >> > ],
> > > > > > > > > >> >
> > > > > > > > > >> > In all these PGs osd.10 is involved, but that OSD is down and out. I
> > > > > > > > > >> tried marking it as down again, but that didn't work.
> > > > > > > > > >> >
> > > > > > > > > >> > I haven't tried removing osd.10 yet from the CRUSHMap since that will
> > > > > > > > > >> trigger a rather large rebalance.
> > > > > > > > > >> >
> > > > > > > > > >> > This cluster is still running with the Dumpling tunables though, so that
> > > > > > > > > >> might be the issue. But before I trigger a very large rebalance I wanted to
> > > > > > > > > >> check if there are any insights on this one.
> > > > > > > > > >> >
> > > > > > > > > >> > Thanks,
> > > > > > > > > >> >
> > > > > > > > > >> > Wido
> > > > > > > > > >> > _______________________________________________
> > > > > > > > > >> > ceph-users mailing list
> > > > > > > > > >> > ceph-***@lists.ceph.com
> > > > > > > > > >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > > > > > > > _______________________________________________
> > > > > > > > > ceph-users mailing list
> > > > > > > > > ceph-***@lists.ceph.com
> > > > > > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > > > > > > >
> > > > > > > > >
> > > > > > > _______________________________________________
> > > > > > > ceph-users mailing list
> > > > > > > ceph-***@lists.ceph.com
> > > > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > > > >
> > > > > >
> > > >
> > > >
> >
> >
Sage Weil
2016-11-02 15:21:08 UTC
Permalink
On Wed, 2 Nov 2016, Wido den Hollander wrote:
> > > > I'm pretty sure this is a race condition that got cleaned up as part of
> > > > https://github.com/ceph/ceph/pull/9078/commits. The mon only checks the
> > > > pg_temp entries that are getting set/changed, and since those are already
> > > > in place it doesn't recheck them. Any poke to the cluster that triggers
> > > > peering ought to be enough to clear it up. So, no need for logs, thanks!
> > > >
> > >
> > > Ok, just checking.
> > >
> > > > We could add a special check during, say, upgrade, but generally the PGs
> > > > will re-peer as the OSDs restart anyway and that will clear it up.
> > > >
> > > > Maybe you can just confirm that marking an osd down (say, ceph osd down
> > > > 31) is also enough to remove the stray entry?
> > > >
> > >
> > > I already tried a restart of the OSDs, but that didn't work. I marked osd 31, 160 and 138 as down for PG 4.862 but that didn't work:
> > >
> > > pg_temp 4.862 [31,160,138,2]
> > >
> > > But this works:
> > >
> > > ***@mon1:~# ceph osd dump|grep pg_temp
> > > pg_temp 4.862 [31,160,138,2]
> > > pg_temp 4.a83 [156,83,10,7]
> > > pg_temp 4.e8e [164,78,10,8]
> > > ***@mon1:~# ceph osd pg-temp 4.862 31
> > > set 4.862 pg_temp mapping to [31]
> > > ***@mon1:~# ceph osd dump|grep pg_temp
> > > pg_temp 4.a83 [156,83,10,7]
> > > pg_temp 4.e8e [164,78,10,8]
> > > ***@mon1:~#
> > >
> > > So the restarts nor the marking down fixed the issue. Only the pg-temp trick.
> > >
> > > Still have two PGs left which I can test with.
> >
> > Hmm. Did you leave the OSD down long enough for the PG to peer without
> > it? Can you confirm that doesn't work?
> >
>
> I stopped osd.31, waited for all PGs to re-peer, waited another minute or so and started it again, that didn't work. The pg_temp wasn't resolved.
>
> The whole cluster runs 0.94.9

Hrmpf. Well, I guess that means a special case on upgrade would be
helpful. Not convinced it's the most important thing though, given this
is probably a pretty rare case and can be fixed manually. (OTOH, most
operators won't know that...)

sage
Wido den Hollander
2016-11-02 15:25:38 UTC
Permalink
> Op 2 november 2016 om 16:21 schreef Sage Weil <***@newdream.net>:
>
>
> On Wed, 2 Nov 2016, Wido den Hollander wrote:
> > > > > I'm pretty sure this is a race condition that got cleaned up as part of
> > > > > https://github.com/ceph/ceph/pull/9078/commits. The mon only checks the
> > > > > pg_temp entries that are getting set/changed, and since those are already
> > > > > in place it doesn't recheck them. Any poke to the cluster that triggers
> > > > > peering ought to be enough to clear it up. So, no need for logs, thanks!
> > > > >
> > > >
> > > > Ok, just checking.
> > > >
> > > > > We could add a special check during, say, upgrade, but generally the PGs
> > > > > will re-peer as the OSDs restart anyway and that will clear it up.
> > > > >
> > > > > Maybe you can just confirm that marking an osd down (say, ceph osd down
> > > > > 31) is also enough to remove the stray entry?
> > > > >
> > > >
> > > > I already tried a restart of the OSDs, but that didn't work. I marked osd 31, 160 and 138 as down for PG 4.862 but that didn't work:
> > > >
> > > > pg_temp 4.862 [31,160,138,2]
> > > >
> > > > But this works:
> > > >
> > > > ***@mon1:~# ceph osd dump|grep pg_temp
> > > > pg_temp 4.862 [31,160,138,2]
> > > > pg_temp 4.a83 [156,83,10,7]
> > > > pg_temp 4.e8e [164,78,10,8]
> > > > ***@mon1:~# ceph osd pg-temp 4.862 31
> > > > set 4.862 pg_temp mapping to [31]
> > > > ***@mon1:~# ceph osd dump|grep pg_temp
> > > > pg_temp 4.a83 [156,83,10,7]
> > > > pg_temp 4.e8e [164,78,10,8]
> > > > ***@mon1:~#
> > > >
> > > > So the restarts nor the marking down fixed the issue. Only the pg-temp trick.
> > > >
> > > > Still have two PGs left which I can test with.
> > >
> > > Hmm. Did you leave the OSD down long enough for the PG to peer without
> > > it? Can you confirm that doesn't work?
> > >
> >
> > I stopped osd.31, waited for all PGs to re-peer, waited another minute or so and started it again, that didn't work. The pg_temp wasn't resolved.
> >
> > The whole cluster runs 0.94.9
>
> Hrmpf. Well, I guess that means a special case on upgrade would be
> helpful. Not convinced it's the most important thing though, given this
> is probably a pretty rare case and can be fixed manually. (OTOH, most
> operators won't know that...)
>

Yes, I think so. It's on the ML now so search machines can find it if needed!

Fixing the PGs now manually so that the MON stores can start to trim.

Wido

> sage
Continue reading on narkive:
Loading...