Daznis
2016-11-23 05:56:49 UTC
Hello,
The story goes like this.
I have added another 3 drives to the caching layer. OSDs were added to
crush map one by one after each successful rebalance. When I added the
last OSD and went away for about an hour I noticed that it's still not
finished rebalancing. Further investigation showed me that it one of
the older cache SSD was restarting like crazy before full boot. So I
shut it down and waited for a rebalance without that OSD. Less than an
hour later I had another 2 OSD restarting like crazy. I tried running
scrubs on the PG's logs asked me to, but that did not help. I'm
currently stuck with
" 8 scrub errors" and a complete dead cluster.
log_channel(cluster) log [WRN] : pg 15.8d has invalid (post-split)
stats; must scrub before tier agent can activate
I need help with OSD from crashing. Crash log:
0> 2016-11-23 06:41:43.365602 7f935b4eb700 -1
osd/ReplicatedPG.cc: In function 'void
ReplicatedPG::hit_set_trim(ReplicatedPG::RepGather*, unsigned int)'
thread 7f935b4eb700 time 2016-11-23 06:41:43.363067
osd/ReplicatedPG.cc: 10521: FAILED assert(obc)
ceph version 0.94.9 (fe6d859066244b97b24f09d46552afc2071e6f90)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x85) [0xbde2c5]
2: (ReplicatedPG::hit_set_trim(ReplicatedPG::RepGather*, unsigned
int)+0x75f) [0x87e89f]
3: (ReplicatedPG::hit_set_persist()+0xedb) [0x87f8bb]
4: (ReplicatedPG::do_op(std::tr1::shared_ptr<OpRequest>&)+0xe3a) [0x8a11aa]
5: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>&,
ThreadPool::TPHandle&)+0x68a) [0x83c37a]
6: (OSD::dequeue_op(boost::intrusive_ptr<PG>,
std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x405)
[0x69af05]
7: (OSD::ShardedOpWQ::_process(unsigned int,
ceph::heartbeat_handle_d*)+0x333) [0x69b473]
8: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x86f) [0xbcd9cf]
9: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xbcfb00]
10: (()+0x7dc5) [0x7f93b9df4dc5]
11: (clone()+0x6d) [0x7f93b88d5ced]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is
needed to interpret this.
I have tried looking with full debug enabled, but those logs didn't
help me much. I have tried to evict the cache layer, but some objects
are stuck and can't be removed. Any suggestions would be greatly
appreciated.
The story goes like this.
I have added another 3 drives to the caching layer. OSDs were added to
crush map one by one after each successful rebalance. When I added the
last OSD and went away for about an hour I noticed that it's still not
finished rebalancing. Further investigation showed me that it one of
the older cache SSD was restarting like crazy before full boot. So I
shut it down and waited for a rebalance without that OSD. Less than an
hour later I had another 2 OSD restarting like crazy. I tried running
scrubs on the PG's logs asked me to, but that did not help. I'm
currently stuck with
" 8 scrub errors" and a complete dead cluster.
log_channel(cluster) log [WRN] : pg 15.8d has invalid (post-split)
stats; must scrub before tier agent can activate
I need help with OSD from crashing. Crash log:
0> 2016-11-23 06:41:43.365602 7f935b4eb700 -1
osd/ReplicatedPG.cc: In function 'void
ReplicatedPG::hit_set_trim(ReplicatedPG::RepGather*, unsigned int)'
thread 7f935b4eb700 time 2016-11-23 06:41:43.363067
osd/ReplicatedPG.cc: 10521: FAILED assert(obc)
ceph version 0.94.9 (fe6d859066244b97b24f09d46552afc2071e6f90)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x85) [0xbde2c5]
2: (ReplicatedPG::hit_set_trim(ReplicatedPG::RepGather*, unsigned
int)+0x75f) [0x87e89f]
3: (ReplicatedPG::hit_set_persist()+0xedb) [0x87f8bb]
4: (ReplicatedPG::do_op(std::tr1::shared_ptr<OpRequest>&)+0xe3a) [0x8a11aa]
5: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>&,
ThreadPool::TPHandle&)+0x68a) [0x83c37a]
6: (OSD::dequeue_op(boost::intrusive_ptr<PG>,
std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x405)
[0x69af05]
7: (OSD::ShardedOpWQ::_process(unsigned int,
ceph::heartbeat_handle_d*)+0x333) [0x69b473]
8: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x86f) [0xbcd9cf]
9: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xbcfb00]
10: (()+0x7dc5) [0x7f93b9df4dc5]
11: (clone()+0x6d) [0x7f93b88d5ced]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is
needed to interpret this.
I have tried looking with full debug enabled, but those logs didn't
help me much. I have tried to evict the cache layer, but some objects
are stuck and can't be removed. Any suggestions would be greatly
appreciated.