Smith, Eric
2018-12-04 11:25:10 UTC
We were upgrading from Ceph Hammer to Ceph Jewel, we updated our OS from CentOS 7.1 to CentOS 7.3 prior to this without issue â we ran into 2 issues:
1. FAILED assert(0 == "Missing map in load_pgs")
* We found the following article fixed this issue:
https://www.mail-archive.com/search?l=ceph-***@lists.ceph.com&q=subject:%22%5C%5Bceph%5C-users%5C%5D+Bug+in+OSD+Maps%22&o=newest&f=1
1. We had 3 other OSDs that went down are asserting with the following:
0> 2018-12-04 04:20:06.793803 7f375174b700 -1 osd/PGLog.cc: In function 'static void PGLog::_merge_object_divergent_entries(const PGLog::IndexedLog&, const hobject_t&, const std::list<pg_log_entry_t>&, const pg_info_t&, eversion_t, pg_missing_t&, boost::optional<std::pair<eversion_t, hobject_t> >*, PGLog::LogEntryHandler*, const DoutPrefixProvider*)' thread 7f375174b700 time 2018-12-04 04:20:06.789747
osd/PGLog.cc: 391: FAILED assert(objiter->second->version > last_divergent_update)
ceph version 10.2.7.aq1 (b76d08dbcee5d59ac08004fda6976b64df3ff59b)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0x558c33ac9105]
2: (PGLog::_merge_object_divergent_entries(PGLog::IndexedLog const&, hobject_t const&, std::list<pg_log_entry_t, std::allocator<pg_log_entry_t> > const&, pg_info_t const&, eversion_t, pg_missing_t&, boost::optional<std::pair<eversion_t, hobject_t> >*, PGLog::LogEntryHandler*, DoutPrefixProvider const*)+0x20d4) [0x558c336b8224]
3: (PGLog::_merge_divergent_entries(PGLog::IndexedLog const&, std::list<pg_log_entry_t, std::allocator<pg_log_entry_t> >&, pg_info_t const&, eversion_t, pg_missing_t&, std::map<eversion_t, hobject_t, std::less<eversion_t>, std::allocator<std::pair<eversion_t const, hobject_t> > >*, PGLog::LogEntryHandler*, DoutPrefixProvider const*)+0x20b) [0x558c336beb5b]
4: (PGLog::merge_log(ObjectStore::Transaction&, pg_info_t&, pg_log_t&, pg_shard_t, pg_info_t&, PGLog::LogEntryHandler*, bool&, bool&)+0xdbc) [0x558c336bc4fc]
5: (PG::merge_log(ObjectStore::Transaction&, pg_info_t&, pg_log_t&, pg_shard_t)+0xbf) [0x558c334e9b8f]
6: (PG::RecoveryState::Stray::react(PG::MLogRec const&)+0x3d1) [0x558c33515ce1]
7: (boost::statechart::simple_state<PG::RecoveryState::Stray, PG::RecoveryState::Started, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x214) [0x558c335526e4]
8: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::send_event(boost::statechart::event_base const&)+0x6b) [0x558c3353bc5b]
9: (PG::handle_peering_event(std::shared_ptr<PG::CephPeeringEvt>, PG::RecoveryCtx*)+0x1f4) [0x558c33502a54]
10: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x259) [0x558c3345b519]
11: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x12) [0x558c334a5c82]
12: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa7e) [0x558c33aba14e]
13: (ThreadPool::WorkThread::entry()+0x10) [0x558c33abb030]
14: (()+0x7e25) [0x7f377d691e25]
15: (clone()+0x6d) [0x7f377bd1b34d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Everything Iâve found regarding this seems to indicate a hardware problem however these disks mount just fine, there are no errors in dmesg or /var/log/messages, and xfs_repair returns no errors.
Any idea on where to start troubleshooting this?
1. FAILED assert(0 == "Missing map in load_pgs")
* We found the following article fixed this issue:
https://www.mail-archive.com/search?l=ceph-***@lists.ceph.com&q=subject:%22%5C%5Bceph%5C-users%5C%5D+Bug+in+OSD+Maps%22&o=newest&f=1
1. We had 3 other OSDs that went down are asserting with the following:
0> 2018-12-04 04:20:06.793803 7f375174b700 -1 osd/PGLog.cc: In function 'static void PGLog::_merge_object_divergent_entries(const PGLog::IndexedLog&, const hobject_t&, const std::list<pg_log_entry_t>&, const pg_info_t&, eversion_t, pg_missing_t&, boost::optional<std::pair<eversion_t, hobject_t> >*, PGLog::LogEntryHandler*, const DoutPrefixProvider*)' thread 7f375174b700 time 2018-12-04 04:20:06.789747
osd/PGLog.cc: 391: FAILED assert(objiter->second->version > last_divergent_update)
ceph version 10.2.7.aq1 (b76d08dbcee5d59ac08004fda6976b64df3ff59b)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0x558c33ac9105]
2: (PGLog::_merge_object_divergent_entries(PGLog::IndexedLog const&, hobject_t const&, std::list<pg_log_entry_t, std::allocator<pg_log_entry_t> > const&, pg_info_t const&, eversion_t, pg_missing_t&, boost::optional<std::pair<eversion_t, hobject_t> >*, PGLog::LogEntryHandler*, DoutPrefixProvider const*)+0x20d4) [0x558c336b8224]
3: (PGLog::_merge_divergent_entries(PGLog::IndexedLog const&, std::list<pg_log_entry_t, std::allocator<pg_log_entry_t> >&, pg_info_t const&, eversion_t, pg_missing_t&, std::map<eversion_t, hobject_t, std::less<eversion_t>, std::allocator<std::pair<eversion_t const, hobject_t> > >*, PGLog::LogEntryHandler*, DoutPrefixProvider const*)+0x20b) [0x558c336beb5b]
4: (PGLog::merge_log(ObjectStore::Transaction&, pg_info_t&, pg_log_t&, pg_shard_t, pg_info_t&, PGLog::LogEntryHandler*, bool&, bool&)+0xdbc) [0x558c336bc4fc]
5: (PG::merge_log(ObjectStore::Transaction&, pg_info_t&, pg_log_t&, pg_shard_t)+0xbf) [0x558c334e9b8f]
6: (PG::RecoveryState::Stray::react(PG::MLogRec const&)+0x3d1) [0x558c33515ce1]
7: (boost::statechart::simple_state<PG::RecoveryState::Stray, PG::RecoveryState::Started, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x214) [0x558c335526e4]
8: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::send_event(boost::statechart::event_base const&)+0x6b) [0x558c3353bc5b]
9: (PG::handle_peering_event(std::shared_ptr<PG::CephPeeringEvt>, PG::RecoveryCtx*)+0x1f4) [0x558c33502a54]
10: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x259) [0x558c3345b519]
11: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x12) [0x558c334a5c82]
12: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa7e) [0x558c33aba14e]
13: (ThreadPool::WorkThread::entry()+0x10) [0x558c33abb030]
14: (()+0x7e25) [0x7f377d691e25]
15: (clone()+0x6d) [0x7f377bd1b34d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Everything Iâve found regarding this seems to indicate a hardware problem however these disks mount just fine, there are no errors in dmesg or /var/log/messages, and xfs_repair returns no errors.
Any idea on where to start troubleshooting this?