[ceph-users] Ceph strange issue after adding a cache OSD.

Discussion:

Daznis

2016-11-23 05:56:49 UTC

Hello,

The story goes like this.
I have added another 3 drives to the caching layer. OSDs were added to
crush map one by one after each successful rebalance. When I added the
last OSD and went away for about an hour I noticed that it's still not
finished rebalancing. Further investigation showed me that it one of
the older cache SSD was restarting like crazy before full boot. So I
shut it down and waited for a rebalance without that OSD. Less than an
hour later I had another 2 OSD restarting like crazy. I tried running
scrubs on the PG's logs asked me to, but that did not help. I'm
currently stuck with
" 8 scrub errors" and a complete dead cluster.

log_channel(cluster) log [WRN] : pg 15.8d has invalid (post-split)
stats; must scrub before tier agent can activate

I need help with OSD from crashing. Crash log:
0> 2016-11-23 06:41:43.365602 7f935b4eb700 -1
osd/ReplicatedPG.cc: In function 'void
ReplicatedPG::hit_set_trim(ReplicatedPG::RepGather*, unsigned int)'
thread 7f935b4eb700 time 2016-11-23 06:41:43.363067
osd/ReplicatedPG.cc: 10521: FAILED assert(obc)

ceph version 0.94.9 (fe6d859066244b97b24f09d46552afc2071e6f90)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x85) [0xbde2c5]
2: (ReplicatedPG::hit_set_trim(ReplicatedPG::RepGather*, unsigned
int)+0x75f) [0x87e89f]
3: (ReplicatedPG::hit_set_persist()+0xedb) [0x87f8bb]
4: (ReplicatedPG::do_op(std::tr1::shared_ptr<OpRequest>&)+0xe3a) [0x8a11aa]
5: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>&,
ThreadPool::TPHandle&)+0x68a) [0x83c37a]
6: (OSD::dequeue_op(boost::intrusive_ptr<PG>,
std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x405)
[0x69af05]
7: (OSD::ShardedOpWQ::_process(unsigned int,
ceph::heartbeat_handle_d*)+0x333) [0x69b473]
8: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x86f) [0xbcd9cf]
9: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xbcfb00]
10: (()+0x7dc5) [0x7f93b9df4dc5]
11: (clone()+0x6d) [0x7f93b88d5ced]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is
needed to interpret this.

I have tried looking with full debug enabled, but those logs didn't
help me much. I have tried to evict the cache layer, but some objects
are stuck and can't be removed. Any suggestions would be greatly
appreciated.

Nick Fisk

2016-11-23 10:04:31 UTC

Permalink

Hi Daznis,

I'm not sure how much help I can be, but I will try my best.

I think the post-split stats error is probably benign, although I think this suggests you also increased the number of PG's in your
cache pool? If so did you do this before or after you added the extra OSD's? This may have been the cause.

On to the actual assert, this looks like it's part of the code which trims the tiering hit set's. I don't understand why its
crashing out, but it must be related to an invalid or missing hitset I would imagine.

https://github.com/ceph/ceph/blob/v0.94.9/src/osd/ReplicatedPG.cc#L10485

The only thing I could think of from looking at in the code is that the function loops through all hitsets that are above the max
number (hit_set_count). I wonder if setting this number higher would mean it won't try and trim any hitsets and let things recover?

DISCLAIMER
This is a hunch, it might not work or could possibly even make things worse. Otherwise wait for someone who has a better idea to
comment.

Nick

-----Original Message-----
Sent: 23 November 2016 05:57
Subject: [ceph-users] Ceph strange issue after adding a cache OSD.
Hello,
The story goes like this.
I have added another 3 drives to the caching layer. OSDs were added to crush map one by one after each successful rebalance. When

added the last OSD and went away for about an hour I noticed that it's still not finished rebalancing. Further investigation

showed me

that it one of the older cache SSD was restarting like crazy before full boot. So I shut it down and waited for a rebalance

without that

OSD. Less than an hour later I had another 2 OSD restarting like crazy. I tried running scrubs on the PG's logs asked me to, but

that did

not help. I'm currently stuck with " 8 scrub errors" and a complete dead cluster.
log_channel(cluster) log [WRN] : pg 15.8d has invalid (post-split) stats; must scrub before tier agent can activate
0> 2016-11-23 06:41:43.365602 7f935b4eb700 -1
osd/ReplicatedPG.cc: In function 'void
ReplicatedPG::hit_set_trim(ReplicatedPG::RepGather*, unsigned int)'
thread 7f935b4eb700 time 2016-11-23 06:41:43.363067
osd/ReplicatedPG.cc: 10521: FAILED assert(obc)
ceph version 0.94.9 (fe6d859066244b97b24f09d46552afc2071e6f90)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x85) [0xbde2c5]
2: (ReplicatedPG::hit_set_trim(ReplicatedPG::RepGather*, unsigned
int)+0x75f) [0x87e89f]
3: (ReplicatedPG::hit_set_persist()+0xedb) [0x87f8bb]
4: (ReplicatedPG::do_op(std::tr1::shared_ptr<OpRequest>&)+0xe3a) [0x8a11aa]
5: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>&,
ThreadPool::TPHandle&)+0x68a) [0x83c37a]
6: (OSD::dequeue_op(boost::intrusive_ptr<PG>,
std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x405) [0x69af05]
7: (OSD::ShardedOpWQ::_process(unsigned int,
ceph::heartbeat_handle_d*)+0x333) [0x69b473]
8: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x86f) [0xbcd9cf]
9: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xbcfb00]
10: (()+0x7dc5) [0x7f93b9df4dc5]
11: (clone()+0x6d) [0x7f93b88d5ced]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
I have tried looking with full debug enabled, but those logs didn't help me much. I have tried to evict the cache layer, but some
objects are stuck and can't be removed. Any suggestions would be greatly appreciated.
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Daznis

2016-11-23 10:16:49 UTC

Permalink

Hi,

Looks like one of my colleagues increased the PG number before it
finished. I was flushing the whole cache tier and it's currently stuck
on ~80 GB of data, because of the OSD crashes. I will look into the
hitset counts and check what can be done. Will provide an update if I
find anything or fix the issue.

Post by Nick Fisk
Hi Daznis,
I'm not sure how much help I can be, but I will try my best.
I think the post-split stats error is probably benign, although I think this suggests you also increased the number of PG's in your
cache pool? If so did you do this before or after you added the extra OSD's? This may have been the cause.
On to the actual assert, this looks like it's part of the code which trims the tiering hit set's. I don't understand why its
crashing out, but it must be related to an invalid or missing hitset I would imagine.
https://github.com/ceph/ceph/blob/v0.94.9/src/osd/ReplicatedPG.cc#L10485
The only thing I could think of from looking at in the code is that the function loops through all hitsets that are above the max
number (hit_set_count). I wonder if setting this number higher would mean it won't try and trim any hitsets and let things recover?
DISCLAIMER
This is a hunch, it might not work or could possibly even make things worse. Otherwise wait for someone who has a better idea to
comment.
Nick

added the last OSD and went away for about an hour I noticed that it's still not finished rebalancing. Further investigation

showed me

that it one of the older cache SSD was restarting like crazy before full boot. So I shut it down and waited for a rebalance

without that

OSD. Less than an hour later I had another 2 OSD restarting like crazy. I tried running scrubs on the PG's logs asked me to, but

that did

Nick Fisk

2016-11-23 10:31:42 UTC

Permalink

-----Original Message-----
Sent: 23 November 2016 10:17
Subject: Re: [ceph-users] Ceph strange issue after adding a cache OSD.
Hi,
Looks like one of my colleagues increased the PG number before it finished. I was flushing the whole cache tier and it's currently stuck
on ~80 GB of data, because of the OSD crashes. I will look into the hitset counts and check what can be done. Will provide an update if
I find anything or fix the issue.

So I'm guessing when the PG split, the stats/hit_sets are not how the OSD is expecting them to be and causes the crash. I would expect this has been caused by the PG splitting rather than introducing extra OSD's. If you manage to get things stable by bumping up the hitset count, then you probably want to try and do a scrub to try and clean up the stats, which may then stop this happening when the hitset comes round to being trimmed again.

extra OSD's? This may have been the cause.

Post by Nick Fisk
On to the actual assert, this looks like it's part of the code which
trims the tiering hit set's. I don't understand why its crashing out, but it must be related to an invalid or missing hitset I would

imagine.

Post by Nick Fisk
https://github.com/ceph/ceph/blob/v0.94.9/src/osd/ReplicatedPG.cc#L104
85
The only thing I could think of from looking at in the code is that
the function loops through all hitsets that are above the max number (hit_set_count). I wonder if setting this number higher would

mean it won't try and trim any hitsets and let things recover?

Post by Nick Fisk
DISCLAIMER
This is a hunch, it might not work or could possibly even make things
worse. Otherwise wait for someone who has a better idea to comment.
Nick

-----Original Message-----
Sent: 23 November 2016 05:57
Subject: [ceph-users] Ceph strange issue after adding a cache OSD.
Hello,
The story goes like this.
I have added another 3 drives to the caching layer. OSDs were added
to crush map one by one after each successful rebalance. When

added the last OSD and went away for about an hour I noticed that
it's still not finished rebalancing. Further investigation

showed me

that it one of the older cache SSD was restarting like crazy before
full boot. So I shut it down and waited for a rebalance

without that

OSD. Less than an hour later I had another 2 OSD restarting like
crazy. I tried running scrubs on the PG's logs asked me to, but

that did

not help. I'm currently stuck with " 8 scrub errors" and a complete dead cluster.
log_channel(cluster) log [WRN] : pg 15.8d has invalid (post-split)
stats; must scrub before tier agent can activate
0> 2016-11-23 06:41:43.365602 7f935b4eb700 -1
osd/ReplicatedPG.cc: In function 'void
ReplicatedPG::hit_set_trim(ReplicatedPG::RepGather*, unsigned int)'
thread 7f935b4eb700 time 2016-11-23 06:41:43.363067
osd/ReplicatedPG.cc: 10521: FAILED assert(obc)
ceph version 0.94.9 (fe6d859066244b97b24f09d46552afc2071e6f90)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x85) [0xbde2c5]
2: (ReplicatedPG::hit_set_trim(ReplicatedPG::RepGather*, unsigned
int)+0x75f) [0x87e89f]
3: (ReplicatedPG::hit_set_persist()+0xedb) [0x87f8bb]
4: (ReplicatedPG::do_op(std::tr1::shared_ptr<OpRequest>&)+0xe3a) [0x8a11aa]
5: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>&,
ThreadPool::TPHandle&)+0x68a) [0x83c37a]
6: (OSD::dequeue_op(boost::intrusive_ptr<PG>,
std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x405) [0x69af05]
7: (OSD::ShardedOpWQ::_process(unsigned int,
ceph::heartbeat_handle_d*)+0x333) [0x69b473]
8: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x86f) [0xbcd9cf]
9: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xbcfb00]
10: (()+0x7dc5) [0x7f93b9df4dc5]
11: (clone()+0x6d) [0x7f93b88d5ced]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
I have tried looking with full debug enabled, but those logs didn't
help me much. I have tried to evict the cache layer, but some objects are stuck and can't be removed. Any suggestions would be

greatly appreciated.