[ceph-users] Upgrade to Infernalis: failed to pick suitable auth object

Discussion:

Kees Meijs

2018-08-17 14:53:15 UTC

Hi Cephers,

For the last months (well... years actually) we were quite happy using
Hammer. So far, there was no immediate cause implying an upgrade.

However, having seen Luminous providing support for BlueStore, it seemed
like a good idea to perform some upgrade steps.

Doing baby steps, I wanted to upgrade from Hammer to Infernalis first
since all ownerships should be changed because of using an unprivileged
user (good stuff!) instead of root.

So far, I've upgraded all monitors from Hammer (0.94.10) to Infernalis
(9.2.1). All seemed well resulting in HEALTH_OK.

Then, I tried upgrading one OSD server using the following procedure:

1. Alter APT sources to utilise Infernalis instead of Hammer.
2. Update and upgrade the packages.
3. Since I didn't want any rebalancing going on, I ran "ceph osd set
noout" as well.
4. Stop a OSD, then chown ceph:ceph -R /var/lib/ceph/osd/ceph-X, start
the OSD and so on.

Maybe I acted too quickly (ehrm... didn't wait long enough) but at some
point it seemed not all ownership was changed during the process.
Meanwhile we were still HEALTH_OK so I didn't really worry and fixed
left-overs using find /var/lib/ceph -not -user ceph -exec chown
ceph:ceph '{}' ';'

It seemed to work well and two days passed without any issues.

Â Â Â Â health HEALTH_ERR
Â Â Â Â Â Â Â Â Â Â Â 1 pgs inconsistent
Â Â Â Â Â Â Â Â Â Â Â 2 scrub errors

So far, I figured out the two scrubbing errors apply to the same OSD,
being osd.0.

2018-08-17 15:25:36.810866 7fa3c9e09700Â 0 log_channel(cluster) log
[INF] : 3.72 deep-scrub starts
2018-08-17 15:25:37.221562 7fa3c7604700 -1 log_channel(cluster) log
[ERR] : 3.72 soid -5/00000072/temp_3.72_0_16187756_3476/head: failed
to pick suitable auth object
2018-08-17 15:25:37.221566 7fa3c7604700 -1 log_channel(cluster) log
[ERR] : 3.72 soid -5/00000072/temp_3.72_0_16195026_251/head: failed to
pick suitable auth object
2018-08-17 15:46:36.257994 7fa3c7604700 -1 log_channel(cluster) log
[ERR] : 3.72 deep-scrub 2 errors

The situation seems similar to http://tracker.ceph.com/issues/13862 but
so far I'm unable to repair the placement group.

Meanwhile I'm forcing deep scrubbing for all placement groups applicable
to osd.0, hopefully resulting in just PG 3.72 having errors.

Awaiting deep scrubbing to finish, it seemed like a good idea to ask you
guys for help.

What's the best approach at this point?

eph health detail
HEALTH_ERR 1 pgs inconsistent; 2 scrub errors
pg 3.72 is active+clean+inconsistent, acting [0,33,39]
2 scrub errors

OSDs 33 and 39 are untouched (still running 0.94.10) and seem fine
without errors.

Thanks in advance for any comments or thoughts.

Regards and enjoy your weekend!
Kees
--
https://nefos.nl/contact

Nefos IT bv
Ambachtsweg 25 (industrienummer 4217)
5627 BZ Eindhoven
Nederland

KvK 66494931

/Aanwezig op maandag, dinsdag, woensdag en vrijdag/

David Turner

2018-08-17 15:21:57 UTC