Discussion:
[ceph-users] OSD wont start after moving to a new node with ceph 12.2.10
Cassiano Pilipavicius
2018-11-27 22:04:18 UTC
Permalink
Hi, I am facing a problem where a OSD wont start after moving to a new
node with 12.2.10 (the old one has 12.2.8)

I have one node of my cluster failed and trued to move 3 osds to a new
node. 2 of the 3 osds has started and is running fine at the moment
(backfiling is still in place.) but one of the osds just dont start with
the following error on the logs (writing mostly to try to find if this
is a bug or if have I done something wrong):

2018-11-27 19:44:38.013454 7fba0d35fd80 -1
bluestore(/var/lib/ceph/osd/ceph-1) _verify_csum bad crc32c/0x1000
checksum at blob offset 0x0, got 0xb1a184d1, expected 0xb682fc52, device
location [0x10000~1000], logical extent 0x0~1000, object
#-1:7b3f43c4:::osd_superblock:0#
2018-11-27 19:44:38.013501 7fba0d35fd80 -1 osd.1 0 OSD::init() : unable
to read osd superblock
2018-11-27 19:44:38.013511 7fba0d35fd80  1
bluestore(/var/lib/ceph/osd/ceph-1) umount
2018-11-27 19:44:38.065478 7fba0d35fd80  1 stupidalloc 0x0x55ebb04c3f80
shutdown
2018-11-27 19:44:38.077261 7fba0d35fd80  1 freelist shutdown
2018-11-27 19:44:38.077316 7fba0d35fd80  4 rocksdb:
[/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.10/rpm/el7/BUILD/ceph-12.2.10/src/rocksdb/db/db_impl.cc:217]
Shutdown: canceling all background work
2018-11-27 19:44:38.077982 7fba0d35fd80  4 rocksdb:
[/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.10/rpm/el7/BUILD/ceph-12.2.10/src/rocksdb/db/db_impl.cc:343]
Shutdown complete
2018-11-27 19:44:38.107923 7fba0d35fd80  1 bluefs umount
2018-11-27 19:44:38.108248 7fba0d35fd80  1 stupidalloc 0x0x55ebb01cddc0
shutdown
2018-11-27 19:44:38.108302 7fba0d35fd80  1 bdev(0x55ebb01cf800
/var/lib/ceph/osd/ceph-1/block) close
2018-11-27 19:44:38.362984 7fba0d35fd80  1 bdev(0x55ebb01cf600
/var/lib/ceph/osd/ceph-1/block) close
2018-11-27 19:44:38.470791 7fba0d35fd80 -1  ** ERROR: osd init failed:
(22) Invalid argument

My cluster has too many mixed versions, I havent realized that the
versions is changed when running a yum update and righ now I have the
following situation:ceph versions
{
    "mon": {
        "ceph version 12.2.7 (3ec878d1e53e1aeb47a9f619c49d9e7c0aa384d5)
luminous (stable)": 1,
        "ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0)
luminous (stable)": 2
    },
    "mgr": {
        "ceph version 12.2.7 (3ec878d1e53e1aeb47a9f619c49d9e7c0aa384d5)
luminous (stable)": 1
    },
    "osd": {
        "ceph version 12.2.10
(177915764b752804194937482a39e95e0ca3de94) luminous (stable)": 2,
        "ceph version 12.2.7 (3ec878d1e53e1aeb47a9f619c49d9e7c0aa384d5)
luminous (stable)": 18,
        "ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0)
luminous (stable)": 27,
        "ceph version 12.2.9 (9e300932ef8a8916fb3fda78c58691a6ab0f4217)
luminous (stable)": 1
    },
    "mds": {},
    "overall": {
        "ceph version 12.2.10
(177915764b752804194937482a39e95e0ca3de94) luminous (stable)": 2,
        "ceph version 12.2.7 (3ec878d1e53e1aeb47a9f619c49d9e7c0aa384d5)
luminous (stable)": 20,
        "ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0)
luminous (stable)": 29,
        "ceph version 12.2.9 (9e300932ef8a8916fb3fda78c58691a6ab0f4217)
luminous (stable)": 1
    }
}

Is there an easy way to get the OSD working again? I am thinking about
waiting the backfill/recovery to finish and them upgrade all nodes to
12.2.10 and if the OSD dont come up, recreating the OSD.

Regards,
Cassiano Pilipavicius.
Paul Emmerich
2018-11-27 22:16:42 UTC
Permalink
This is *probably* unrelated to the upgrade as it's complaining at a
very early stage about data corruption.
(Earlier than the bug that would trigger related to the 12.2.9 issues)
So this might just be a coincidence with a bad disk.

That being said: you are running a 12.2.9 OSD and you probably should
not upgrade to 12.2.10 especially while a backfill is running.

Paul
--
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

Am Di., 27. Nov. 2018 um 23:04 Uhr schrieb Cassiano Pilipavicius
Post by Cassiano Pilipavicius
Hi, I am facing a problem where a OSD wont start after moving to a new
node with 12.2.10 (the old one has 12.2.8)
I have one node of my cluster failed and trued to move 3 osds to a new
node. 2 of the 3 osds has started and is running fine at the moment
(backfiling is still in place.) but one of the osds just dont start with
the following error on the logs (writing mostly to try to find if this
2018-11-27 19:44:38.013454 7fba0d35fd80 -1
bluestore(/var/lib/ceph/osd/ceph-1) _verify_csum bad crc32c/0x1000
checksum at blob offset 0x0, got 0xb1a184d1, expected 0xb682fc52, device
location [0x10000~1000], logical extent 0x0~1000, object
#-1:7b3f43c4:::osd_superblock:0#
2018-11-27 19:44:38.013501 7fba0d35fd80 -1 osd.1 0 OSD::init() : unable
to read osd superblock
2018-11-27 19:44:38.013511 7fba0d35fd80 1
bluestore(/var/lib/ceph/osd/ceph-1) umount
2018-11-27 19:44:38.065478 7fba0d35fd80 1 stupidalloc 0x0x55ebb04c3f80
shutdown
2018-11-27 19:44:38.077261 7fba0d35fd80 1 freelist shutdown
[/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.10/rpm/el7/BUILD/ceph-12.2.10/src/rocksdb/db/db_impl.cc:217]
Shutdown: canceling all background work
[/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.10/rpm/el7/BUILD/ceph-12.2.10/src/rocksdb/db/db_impl.cc:343]
Shutdown complete
2018-11-27 19:44:38.107923 7fba0d35fd80 1 bluefs umount
2018-11-27 19:44:38.108248 7fba0d35fd80 1 stupidalloc 0x0x55ebb01cddc0
shutdown
2018-11-27 19:44:38.108302 7fba0d35fd80 1 bdev(0x55ebb01cf800
/var/lib/ceph/osd/ceph-1/block) close
2018-11-27 19:44:38.362984 7fba0d35fd80 1 bdev(0x55ebb01cf600
/var/lib/ceph/osd/ceph-1/block) close
(22) Invalid argument
My cluster has too many mixed versions, I havent realized that the
versions is changed when running a yum update and righ now I have the
following situation:ceph versions
{
"mon": {
"ceph version 12.2.7 (3ec878d1e53e1aeb47a9f619c49d9e7c0aa384d5)
luminous (stable)": 1,
"ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0)
luminous (stable)": 2
},
"mgr": {
"ceph version 12.2.7 (3ec878d1e53e1aeb47a9f619c49d9e7c0aa384d5)
luminous (stable)": 1
},
"osd": {
"ceph version 12.2.10
(177915764b752804194937482a39e95e0ca3de94) luminous (stable)": 2,
"ceph version 12.2.7 (3ec878d1e53e1aeb47a9f619c49d9e7c0aa384d5)
luminous (stable)": 18,
"ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0)
luminous (stable)": 27,
"ceph version 12.2.9 (9e300932ef8a8916fb3fda78c58691a6ab0f4217)
luminous (stable)": 1
},
"mds": {},
"overall": {
"ceph version 12.2.10
(177915764b752804194937482a39e95e0ca3de94) luminous (stable)": 2,
"ceph version 12.2.7 (3ec878d1e53e1aeb47a9f619c49d9e7c0aa384d5)
luminous (stable)": 20,
"ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0)
luminous (stable)": 29,
"ceph version 12.2.9 (9e300932ef8a8916fb3fda78c58691a6ab0f4217)
luminous (stable)": 1
}
}
Is there an easy way to get the OSD working again? I am thinking about
waiting the backfill/recovery to finish and them upgrade all nodes to
12.2.10 and if the OSD dont come up, recreating the OSD.
Regards,
Cassiano Pilipavicius.
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Loading...