Cassiano Pilipavicius
2018-11-27 22:04:18 UTC
Hi, I am facing a problem where a OSD wont start after moving to a new
node with 12.2.10 (the old one has 12.2.8)
I have one node of my cluster failed and trued to move 3 osds to a new
node. 2 of the 3 osds has started and is running fine at the moment
(backfiling is still in place.) but one of the osds just dont start with
the following error on the logs (writing mostly to try to find if this
is a bug or if have I done something wrong):
2018-11-27 19:44:38.013454 7fba0d35fd80 -1
bluestore(/var/lib/ceph/osd/ceph-1) _verify_csum bad crc32c/0x1000
checksum at blob offset 0x0, got 0xb1a184d1, expected 0xb682fc52, device
location [0x10000~1000], logical extent 0x0~1000, object
#-1:7b3f43c4:::osd_superblock:0#
2018-11-27 19:44:38.013501 7fba0d35fd80 -1 osd.1 0 OSD::init() : unable
to read osd superblock
2018-11-27 19:44:38.013511 7fba0d35fd80 1
bluestore(/var/lib/ceph/osd/ceph-1) umount
2018-11-27 19:44:38.065478 7fba0d35fd80 1 stupidalloc 0x0x55ebb04c3f80
shutdown
2018-11-27 19:44:38.077261 7fba0d35fd80 1 freelist shutdown
2018-11-27 19:44:38.077316 7fba0d35fd80 4 rocksdb:
[/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.10/rpm/el7/BUILD/ceph-12.2.10/src/rocksdb/db/db_impl.cc:217]
Shutdown: canceling all background work
2018-11-27 19:44:38.077982 7fba0d35fd80 4 rocksdb:
[/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.10/rpm/el7/BUILD/ceph-12.2.10/src/rocksdb/db/db_impl.cc:343]
Shutdown complete
2018-11-27 19:44:38.107923 7fba0d35fd80 1 bluefs umount
2018-11-27 19:44:38.108248 7fba0d35fd80 1 stupidalloc 0x0x55ebb01cddc0
shutdown
2018-11-27 19:44:38.108302 7fba0d35fd80 1 bdev(0x55ebb01cf800
/var/lib/ceph/osd/ceph-1/block) close
2018-11-27 19:44:38.362984 7fba0d35fd80 1 bdev(0x55ebb01cf600
/var/lib/ceph/osd/ceph-1/block) close
2018-11-27 19:44:38.470791 7fba0d35fd80 -1 ** ERROR: osd init failed:
(22) Invalid argument
My cluster has too many mixed versions, I havent realized that the
versions is changed when running a yum update and righ now I have the
following situation:ceph versions
{
"mon": {
"ceph version 12.2.7 (3ec878d1e53e1aeb47a9f619c49d9e7c0aa384d5)
luminous (stable)": 1,
"ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0)
luminous (stable)": 2
},
"mgr": {
"ceph version 12.2.7 (3ec878d1e53e1aeb47a9f619c49d9e7c0aa384d5)
luminous (stable)": 1
},
"osd": {
"ceph version 12.2.10
(177915764b752804194937482a39e95e0ca3de94) luminous (stable)": 2,
"ceph version 12.2.7 (3ec878d1e53e1aeb47a9f619c49d9e7c0aa384d5)
luminous (stable)": 18,
"ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0)
luminous (stable)": 27,
"ceph version 12.2.9 (9e300932ef8a8916fb3fda78c58691a6ab0f4217)
luminous (stable)": 1
},
"mds": {},
"overall": {
"ceph version 12.2.10
(177915764b752804194937482a39e95e0ca3de94) luminous (stable)": 2,
"ceph version 12.2.7 (3ec878d1e53e1aeb47a9f619c49d9e7c0aa384d5)
luminous (stable)": 20,
"ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0)
luminous (stable)": 29,
"ceph version 12.2.9 (9e300932ef8a8916fb3fda78c58691a6ab0f4217)
luminous (stable)": 1
}
}
Is there an easy way to get the OSD working again? I am thinking about
waiting the backfill/recovery to finish and them upgrade all nodes to
12.2.10 and if the OSD dont come up, recreating the OSD.
Regards,
Cassiano Pilipavicius.
node with 12.2.10 (the old one has 12.2.8)
I have one node of my cluster failed and trued to move 3 osds to a new
node. 2 of the 3 osds has started and is running fine at the moment
(backfiling is still in place.) but one of the osds just dont start with
the following error on the logs (writing mostly to try to find if this
is a bug or if have I done something wrong):
2018-11-27 19:44:38.013454 7fba0d35fd80 -1
bluestore(/var/lib/ceph/osd/ceph-1) _verify_csum bad crc32c/0x1000
checksum at blob offset 0x0, got 0xb1a184d1, expected 0xb682fc52, device
location [0x10000~1000], logical extent 0x0~1000, object
#-1:7b3f43c4:::osd_superblock:0#
2018-11-27 19:44:38.013501 7fba0d35fd80 -1 osd.1 0 OSD::init() : unable
to read osd superblock
2018-11-27 19:44:38.013511 7fba0d35fd80 1
bluestore(/var/lib/ceph/osd/ceph-1) umount
2018-11-27 19:44:38.065478 7fba0d35fd80 1 stupidalloc 0x0x55ebb04c3f80
shutdown
2018-11-27 19:44:38.077261 7fba0d35fd80 1 freelist shutdown
2018-11-27 19:44:38.077316 7fba0d35fd80 4 rocksdb:
[/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.10/rpm/el7/BUILD/ceph-12.2.10/src/rocksdb/db/db_impl.cc:217]
Shutdown: canceling all background work
2018-11-27 19:44:38.077982 7fba0d35fd80 4 rocksdb:
[/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.10/rpm/el7/BUILD/ceph-12.2.10/src/rocksdb/db/db_impl.cc:343]
Shutdown complete
2018-11-27 19:44:38.107923 7fba0d35fd80 1 bluefs umount
2018-11-27 19:44:38.108248 7fba0d35fd80 1 stupidalloc 0x0x55ebb01cddc0
shutdown
2018-11-27 19:44:38.108302 7fba0d35fd80 1 bdev(0x55ebb01cf800
/var/lib/ceph/osd/ceph-1/block) close
2018-11-27 19:44:38.362984 7fba0d35fd80 1 bdev(0x55ebb01cf600
/var/lib/ceph/osd/ceph-1/block) close
2018-11-27 19:44:38.470791 7fba0d35fd80 -1 ** ERROR: osd init failed:
(22) Invalid argument
My cluster has too many mixed versions, I havent realized that the
versions is changed when running a yum update and righ now I have the
following situation:ceph versions
{
"mon": {
"ceph version 12.2.7 (3ec878d1e53e1aeb47a9f619c49d9e7c0aa384d5)
luminous (stable)": 1,
"ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0)
luminous (stable)": 2
},
"mgr": {
"ceph version 12.2.7 (3ec878d1e53e1aeb47a9f619c49d9e7c0aa384d5)
luminous (stable)": 1
},
"osd": {
"ceph version 12.2.10
(177915764b752804194937482a39e95e0ca3de94) luminous (stable)": 2,
"ceph version 12.2.7 (3ec878d1e53e1aeb47a9f619c49d9e7c0aa384d5)
luminous (stable)": 18,
"ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0)
luminous (stable)": 27,
"ceph version 12.2.9 (9e300932ef8a8916fb3fda78c58691a6ab0f4217)
luminous (stable)": 1
},
"mds": {},
"overall": {
"ceph version 12.2.10
(177915764b752804194937482a39e95e0ca3de94) luminous (stable)": 2,
"ceph version 12.2.7 (3ec878d1e53e1aeb47a9f619c49d9e7c0aa384d5)
luminous (stable)": 20,
"ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0)
luminous (stable)": 29,
"ceph version 12.2.9 (9e300932ef8a8916fb3fda78c58691a6ab0f4217)
luminous (stable)": 1
}
}
Is there an easy way to get the OSD working again? I am thinking about
waiting the backfill/recovery to finish and them upgrade all nodes to
12.2.10 and if the OSD dont come up, recreating the OSD.
Regards,
Cassiano Pilipavicius.