Discussion:
[ceph-users] Ceph strange issue after adding a cache OSD.
Daznis
2016-11-23 05:56:49 UTC
Permalink
Hello,


The story goes like this.
I have added another 3 drives to the caching layer. OSDs were added to
crush map one by one after each successful rebalance. When I added the
last OSD and went away for about an hour I noticed that it's still not
finished rebalancing. Further investigation showed me that it one of
the older cache SSD was restarting like crazy before full boot. So I
shut it down and waited for a rebalance without that OSD. Less than an
hour later I had another 2 OSD restarting like crazy. I tried running
scrubs on the PG's logs asked me to, but that did not help. I'm
currently stuck with
" 8 scrub errors" and a complete dead cluster.

log_channel(cluster) log [WRN] : pg 15.8d has invalid (post-split)
stats; must scrub before tier agent can activate


I need help with OSD from crashing. Crash log:
0> 2016-11-23 06:41:43.365602 7f935b4eb700 -1
osd/ReplicatedPG.cc: In function 'void
ReplicatedPG::hit_set_trim(ReplicatedPG::RepGather*, unsigned int)'
thread 7f935b4eb700 time 2016-11-23 06:41:43.363067
osd/ReplicatedPG.cc: 10521: FAILED assert(obc)

ceph version 0.94.9 (fe6d859066244b97b24f09d46552afc2071e6f90)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x85) [0xbde2c5]
2: (ReplicatedPG::hit_set_trim(ReplicatedPG::RepGather*, unsigned
int)+0x75f) [0x87e89f]
3: (ReplicatedPG::hit_set_persist()+0xedb) [0x87f8bb]
4: (ReplicatedPG::do_op(std::tr1::shared_ptr<OpRequest>&)+0xe3a) [0x8a11aa]
5: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>&,
ThreadPool::TPHandle&)+0x68a) [0x83c37a]
6: (OSD::dequeue_op(boost::intrusive_ptr<PG>,
std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x405)
[0x69af05]
7: (OSD::ShardedOpWQ::_process(unsigned int,
ceph::heartbeat_handle_d*)+0x333) [0x69b473]
8: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x86f) [0xbcd9cf]
9: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xbcfb00]
10: (()+0x7dc5) [0x7f93b9df4dc5]
11: (clone()+0x6d) [0x7f93b88d5ced]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is
needed to interpret this.


I have tried looking with full debug enabled, but those logs didn't
help me much. I have tried to evict the cache layer, but some objects
are stuck and can't be removed. Any suggestions would be greatly
appreciated.
Nick Fisk
2016-11-23 10:04:31 UTC
Permalink
Hi Daznis,

I'm not sure how much help I can be, but I will try my best.

I think the post-split stats error is probably benign, although I think this suggests you also increased the number of PG's in your
cache pool? If so did you do this before or after you added the extra OSD's? This may have been the cause.

On to the actual assert, this looks like it's part of the code which trims the tiering hit set's. I don't understand why its
crashing out, but it must be related to an invalid or missing hitset I would imagine.

https://github.com/ceph/ceph/blob/v0.94.9/src/osd/ReplicatedPG.cc#L10485

The only thing I could think of from looking at in the code is that the function loops through all hitsets that are above the max
number (hit_set_count). I wonder if setting this number higher would mean it won't try and trim any hitsets and let things recover?

DISCLAIMER
This is a hunch, it might not work or could possibly even make things worse. Otherwise wait for someone who has a better idea to
comment.

Nick
-----Original Message-----
Sent: 23 November 2016 05:57
Subject: [ceph-users] Ceph strange issue after adding a cache OSD.
Hello,
The story goes like this.
I have added another 3 drives to the caching layer. OSDs were added to crush map one by one after each successful rebalance. When
I
added the last OSD and went away for about an hour I noticed that it's still not finished rebalancing. Further investigation
showed me
that it one of the older cache SSD was restarting like crazy before full boot. So I shut it down and waited for a rebalance
without that
OSD. Less than an hour later I had another 2 OSD restarting like crazy. I tried running scrubs on the PG's logs asked me to, but
that did
not help. I'm currently stuck with " 8 scrub errors" and a complete dead cluster.
log_channel(cluster) log [WRN] : pg 15.8d has invalid (post-split) stats; must scrub before tier agent can activate
0> 2016-11-23 06:41:43.365602 7f935b4eb700 -1
osd/ReplicatedPG.cc: In function 'void
ReplicatedPG::hit_set_trim(ReplicatedPG::RepGather*, unsigned int)'
thread 7f935b4eb700 time 2016-11-23 06:41:43.363067
osd/ReplicatedPG.cc: 10521: FAILED assert(obc)
ceph version 0.94.9 (fe6d859066244b97b24f09d46552afc2071e6f90)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x85) [0xbde2c5]
2: (ReplicatedPG::hit_set_trim(ReplicatedPG::RepGather*, unsigned
int)+0x75f) [0x87e89f]
3: (ReplicatedPG::hit_set_persist()+0xedb) [0x87f8bb]
4: (ReplicatedPG::do_op(std::tr1::shared_ptr<OpRequest>&)+0xe3a) [0x8a11aa]
5: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>&,
ThreadPool::TPHandle&)+0x68a) [0x83c37a]
6: (OSD::dequeue_op(boost::intrusive_ptr<PG>,
std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x405) [0x69af05]
7: (OSD::ShardedOpWQ::_process(unsigned int,
ceph::heartbeat_handle_d*)+0x333) [0x69b473]
8: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x86f) [0xbcd9cf]
9: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xbcfb00]
10: (()+0x7dc5) [0x7f93b9df4dc5]
11: (clone()+0x6d) [0x7f93b88d5ced]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
I have tried looking with full debug enabled, but those logs didn't help me much. I have tried to evict the cache layer, but some
objects are stuck and can't be removed. Any suggestions would be greatly appreciated.
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Daznis
2016-11-23 10:16:49 UTC
Permalink
Hi,


Looks like one of my colleagues increased the PG number before it
finished. I was flushing the whole cache tier and it's currently stuck
on ~80 GB of data, because of the OSD crashes. I will look into the
hitset counts and check what can be done. Will provide an update if I
find anything or fix the issue.
Post by Nick Fisk
Hi Daznis,
I'm not sure how much help I can be, but I will try my best.
I think the post-split stats error is probably benign, although I think this suggests you also increased the number of PG's in your
cache pool? If so did you do this before or after you added the extra OSD's? This may have been the cause.
On to the actual assert, this looks like it's part of the code which trims the tiering hit set's. I don't understand why its
crashing out, but it must be related to an invalid or missing hitset I would imagine.
https://github.com/ceph/ceph/blob/v0.94.9/src/osd/ReplicatedPG.cc#L10485
The only thing I could think of from looking at in the code is that the function loops through all hitsets that are above the max
number (hit_set_count). I wonder if setting this number higher would mean it won't try and trim any hitsets and let things recover?
DISCLAIMER
This is a hunch, it might not work or could possibly even make things worse. Otherwise wait for someone who has a better idea to
comment.
Nick
-----Original Message-----
Sent: 23 November 2016 05:57
Subject: [ceph-users] Ceph strange issue after adding a cache OSD.
Hello,
The story goes like this.
I have added another 3 drives to the caching layer. OSDs were added to crush map one by one after each successful rebalance. When
I
added the last OSD and went away for about an hour I noticed that it's still not finished rebalancing. Further investigation
showed me
that it one of the older cache SSD was restarting like crazy before full boot. So I shut it down and waited for a rebalance
without that
OSD. Less than an hour later I had another 2 OSD restarting like crazy. I tried running scrubs on the PG's logs asked me to, but
that did
not help. I'm currently stuck with " 8 scrub errors" and a complete dead cluster.
log_channel(cluster) log [WRN] : pg 15.8d has invalid (post-split) stats; must scrub before tier agent can activate
0> 2016-11-23 06:41:43.365602 7f935b4eb700 -1
osd/ReplicatedPG.cc: In function 'void
ReplicatedPG::hit_set_trim(ReplicatedPG::RepGather*, unsigned int)'
thread 7f935b4eb700 time 2016-11-23 06:41:43.363067
osd/ReplicatedPG.cc: 10521: FAILED assert(obc)
ceph version 0.94.9 (fe6d859066244b97b24f09d46552afc2071e6f90)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x85) [0xbde2c5]
2: (ReplicatedPG::hit_set_trim(ReplicatedPG::RepGather*, unsigned
int)+0x75f) [0x87e89f]
3: (ReplicatedPG::hit_set_persist()+0xedb) [0x87f8bb]
4: (ReplicatedPG::do_op(std::tr1::shared_ptr<OpRequest>&)+0xe3a) [0x8a11aa]
5: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>&,
ThreadPool::TPHandle&)+0x68a) [0x83c37a]
6: (OSD::dequeue_op(boost::intrusive_ptr<PG>,
std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x405) [0x69af05]
7: (OSD::ShardedOpWQ::_process(unsigned int,
ceph::heartbeat_handle_d*)+0x333) [0x69b473]
8: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x86f) [0xbcd9cf]
9: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xbcfb00]
10: (()+0x7dc5) [0x7f93b9df4dc5]
11: (clone()+0x6d) [0x7f93b88d5ced]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
I have tried looking with full debug enabled, but those logs didn't help me much. I have tried to evict the cache layer, but some
objects are stuck and can't be removed. Any suggestions would be greatly appreciated.
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Nick Fisk
2016-11-23 10:31:42 UTC
Permalink
-----Original Message-----
Sent: 23 November 2016 10:17
Subject: Re: [ceph-users] Ceph strange issue after adding a cache OSD.
Hi,
Looks like one of my colleagues increased the PG number before it finished. I was flushing the whole cache tier and it's currently stuck
on ~80 GB of data, because of the OSD crashes. I will look into the hitset counts and check what can be done. Will provide an update if
I find anything or fix the issue.
So I'm guessing when the PG split, the stats/hit_sets are not how the OSD is expecting them to be and causes the crash. I would expect this has been caused by the PG splitting rather than introducing extra OSD's. If you manage to get things stable by bumping up the hitset count, then you probably want to try and do a scrub to try and clean up the stats, which may then stop this happening when the hitset comes round to being trimmed again.
Post by Nick Fisk
Hi Daznis,
I'm not sure how much help I can be, but I will try my best.
I think the post-split stats error is probably benign, although I
think this suggests you also increased the number of PG's in your cache pool? If so did you do this before or after you added the
extra OSD's? This may have been the cause.
Post by Nick Fisk
On to the actual assert, this looks like it's part of the code which
trims the tiering hit set's. I don't understand why its crashing out, but it must be related to an invalid or missing hitset I would
imagine.
Post by Nick Fisk
https://github.com/ceph/ceph/blob/v0.94.9/src/osd/ReplicatedPG.cc#L104
85
The only thing I could think of from looking at in the code is that
the function loops through all hitsets that are above the max number (hit_set_count). I wonder if setting this number higher would
mean it won't try and trim any hitsets and let things recover?
Post by Nick Fisk
DISCLAIMER
This is a hunch, it might not work or could possibly even make things
worse. Otherwise wait for someone who has a better idea to comment.
Nick
-----Original Message-----
Sent: 23 November 2016 05:57
Subject: [ceph-users] Ceph strange issue after adding a cache OSD.
Hello,
The story goes like this.
I have added another 3 drives to the caching layer. OSDs were added
to crush map one by one after each successful rebalance. When
I
added the last OSD and went away for about an hour I noticed that
it's still not finished rebalancing. Further investigation
showed me
that it one of the older cache SSD was restarting like crazy before
full boot. So I shut it down and waited for a rebalance
without that
OSD. Less than an hour later I had another 2 OSD restarting like
crazy. I tried running scrubs on the PG's logs asked me to, but
that did
not help. I'm currently stuck with " 8 scrub errors" and a complete dead cluster.
log_channel(cluster) log [WRN] : pg 15.8d has invalid (post-split)
stats; must scrub before tier agent can activate
0> 2016-11-23 06:41:43.365602 7f935b4eb700 -1
osd/ReplicatedPG.cc: In function 'void
ReplicatedPG::hit_set_trim(ReplicatedPG::RepGather*, unsigned int)'
thread 7f935b4eb700 time 2016-11-23 06:41:43.363067
osd/ReplicatedPG.cc: 10521: FAILED assert(obc)
ceph version 0.94.9 (fe6d859066244b97b24f09d46552afc2071e6f90)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x85) [0xbde2c5]
2: (ReplicatedPG::hit_set_trim(ReplicatedPG::RepGather*, unsigned
int)+0x75f) [0x87e89f]
3: (ReplicatedPG::hit_set_persist()+0xedb) [0x87f8bb]
4: (ReplicatedPG::do_op(std::tr1::shared_ptr<OpRequest>&)+0xe3a) [0x8a11aa]
5: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>&,
ThreadPool::TPHandle&)+0x68a) [0x83c37a]
6: (OSD::dequeue_op(boost::intrusive_ptr<PG>,
std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x405) [0x69af05]
7: (OSD::ShardedOpWQ::_process(unsigned int,
ceph::heartbeat_handle_d*)+0x333) [0x69b473]
8: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x86f) [0xbcd9cf]
9: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xbcfb00]
10: (()+0x7dc5) [0x7f93b9df4dc5]
11: (clone()+0x6d) [0x7f93b88d5ced]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
I have tried looking with full debug enabled, but those logs didn't
help me much. I have tried to evict the cache layer, but some objects are stuck and can't be removed. Any suggestions would be
greatly appreciated.
Post by Nick Fisk
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Daznis
2016-11-23 12:54:31 UTC
Permalink
Thank you. That helped quite a lot. Now I'm just stuck with one OSD
crashing with:

osd/PG.cc: In function 'static int PG::peek_map_epoch(ObjectStore*,
spg_t, epoch_t*, ceph::bufferlist*)' thread 7f36bbdd6880 time
2016-11-23 13:42:43.27
8539
osd/PG.cc: 2911: FAILED assert(r > 0)

ceph version 0.94.9 (fe6d859066244b97b24f09d46552afc2071e6f90)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x85) [0xbde2c5]
2: (PG::peek_map_epoch(ObjectStore*, spg_t, unsigned int*,
ceph::buffer::list*)+0x8ba) [0x7cf4da]
3: (OSD::load_pgs()+0x9ef) [0x6bd31f]
4: (OSD::init()+0x181a) [0x6c0e8a]
5: (main()+0x29dd) [0x6484bd]
6: (__libc_start_main()+0xf5) [0x7f36b916bb15]
7: /usr/bin/ceph-osd() [0x661ea9]
Post by Nick Fisk
-----Original Message-----
Sent: 23 November 2016 10:17
Subject: Re: [ceph-users] Ceph strange issue after adding a cache OSD.
Hi,
Looks like one of my colleagues increased the PG number before it finished. I was flushing the whole cache tier and it's currently stuck
on ~80 GB of data, because of the OSD crashes. I will look into the hitset counts and check what can be done. Will provide an update if
I find anything or fix the issue.
So I'm guessing when the PG split, the stats/hit_sets are not how the OSD is expecting them to be and causes the crash. I would expect this has been caused by the PG splitting rather than introducing extra OSD's. If you manage to get things stable by bumping up the hitset count, then you probably want to try and do a scrub to try and clean up the stats, which may then stop this happening when the hitset comes round to being trimmed again.
Post by Nick Fisk
Hi Daznis,
I'm not sure how much help I can be, but I will try my best.
I think the post-split stats error is probably benign, although I
think this suggests you also increased the number of PG's in your cache pool? If so did you do this before or after you added the
extra OSD's? This may have been the cause.
Post by Nick Fisk
On to the actual assert, this looks like it's part of the code which
trims the tiering hit set's. I don't understand why its crashing out, but it must be related to an invalid or missing hitset I would
imagine.
Post by Nick Fisk
https://github.com/ceph/ceph/blob/v0.94.9/src/osd/ReplicatedPG.cc#L104
85
The only thing I could think of from looking at in the code is that
the function loops through all hitsets that are above the max number (hit_set_count). I wonder if setting this number higher would
mean it won't try and trim any hitsets and let things recover?
Post by Nick Fisk
DISCLAIMER
This is a hunch, it might not work or could possibly even make things
worse. Otherwise wait for someone who has a better idea to comment.
Nick
-----Original Message-----
Sent: 23 November 2016 05:57
Subject: [ceph-users] Ceph strange issue after adding a cache OSD.
Hello,
The story goes like this.
I have added another 3 drives to the caching layer. OSDs were added
to crush map one by one after each successful rebalance. When
I
added the last OSD and went away for about an hour I noticed that
it's still not finished rebalancing. Further investigation
showed me
that it one of the older cache SSD was restarting like crazy before
full boot. So I shut it down and waited for a rebalance
without that
OSD. Less than an hour later I had another 2 OSD restarting like
crazy. I tried running scrubs on the PG's logs asked me to, but
that did
not help. I'm currently stuck with " 8 scrub errors" and a complete dead cluster.
log_channel(cluster) log [WRN] : pg 15.8d has invalid (post-split)
stats; must scrub before tier agent can activate
0> 2016-11-23 06:41:43.365602 7f935b4eb700 -1
osd/ReplicatedPG.cc: In function 'void
ReplicatedPG::hit_set_trim(ReplicatedPG::RepGather*, unsigned int)'
thread 7f935b4eb700 time 2016-11-23 06:41:43.363067
osd/ReplicatedPG.cc: 10521: FAILED assert(obc)
ceph version 0.94.9 (fe6d859066244b97b24f09d46552afc2071e6f90)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x85) [0xbde2c5]
2: (ReplicatedPG::hit_set_trim(ReplicatedPG::RepGather*, unsigned
int)+0x75f) [0x87e89f]
3: (ReplicatedPG::hit_set_persist()+0xedb) [0x87f8bb]
4: (ReplicatedPG::do_op(std::tr1::shared_ptr<OpRequest>&)+0xe3a) [0x8a11aa]
5: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>&,
ThreadPool::TPHandle&)+0x68a) [0x83c37a]
6: (OSD::dequeue_op(boost::intrusive_ptr<PG>,
std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x405) [0x69af05]
7: (OSD::ShardedOpWQ::_process(unsigned int,
ceph::heartbeat_handle_d*)+0x333) [0x69b473]
8: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x86f) [0xbcd9cf]
9: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xbcfb00]
10: (()+0x7dc5) [0x7f93b9df4dc5]
11: (clone()+0x6d) [0x7f93b88d5ced]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
I have tried looking with full debug enabled, but those logs didn't
help me much. I have tried to evict the cache layer, but some objects are stuck and can't be removed. Any suggestions would be
greatly appreciated.
Post by Nick Fisk
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Nick Fisk
2016-11-23 13:08:09 UTC
Permalink
Sorry, I'm afraid I'm out of ideas about that one, that error doesn't mean very much to me. The code suggests the OSD is trying to get an attr from the disk/filesystem, but for some reason it doesn't like that. You could maybe whack the debug logging for OSD and filestore up to max and try and see what PG/file is accessed just before the crash, but I'm not sure what the fix would be, even if you manage to locate the dodgy PG.

Does the cluster have all PG's recovered now? Unless anyone else can comment, you might be best removing/wiping and then re-adding the OSD.
-----Original Message-----
Sent: 23 November 2016 12:55
Subject: Re: [ceph-users] Ceph strange issue after adding a cache OSD.
osd/PG.cc: In function 'static int PG::peek_map_epoch(ObjectStore*, spg_t, epoch_t*, ceph::bufferlist*)' thread 7f36bbdd6880 time
2016-11-23 13:42:43.27
8539
osd/PG.cc: 2911: FAILED assert(r > 0)
ceph version 0.94.9 (fe6d859066244b97b24f09d46552afc2071e6f90)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x85) [0xbde2c5]
2: (PG::peek_map_epoch(ObjectStore*, spg_t, unsigned int*,
ceph::buffer::list*)+0x8ba) [0x7cf4da]
3: (OSD::load_pgs()+0x9ef) [0x6bd31f]
4: (OSD::init()+0x181a) [0x6c0e8a]
5: (main()+0x29dd) [0x6484bd]
6: (__libc_start_main()+0xf5) [0x7f36b916bb15]
7: /usr/bin/ceph-osd() [0x661ea9]
Post by Nick Fisk
-----Original Message-----
Sent: 23 November 2016 10:17
Subject: Re: [ceph-users] Ceph strange issue after adding a cache OSD.
Hi,
Looks like one of my colleagues increased the PG number before it
finished. I was flushing the whole cache tier and it's currently
stuck on ~80 GB of data, because of the OSD crashes. I will look into the hitset counts and check what can be done. Will provide an
update if I find anything or fix the issue.
Post by Nick Fisk
So I'm guessing when the PG split, the stats/hit_sets are not how the OSD is expecting them to be and causes the crash. I would
expect this has been caused by the PG splitting rather than introducing extra OSD's. If you manage to get things stable by bumping up
the hitset count, then you probably want to try and do a scrub to try and clean up the stats, which may then stop this happening when
the hitset comes round to being trimmed again.
Post by Nick Fisk
Post by Nick Fisk
Hi Daznis,
I'm not sure how much help I can be, but I will try my best.
I think the post-split stats error is probably benign, although I
think this suggests you also increased the number of PG's in your
cache pool? If so did you do this before or after you added the
extra OSD's? This may have been the cause.
Post by Nick Fisk
On to the actual assert, this looks like it's part of the code
which trims the tiering hit set's. I don't understand why its
crashing out, but it must be related to an invalid or missing
hitset I would
imagine.
Post by Nick Fisk
https://github.com/ceph/ceph/blob/v0.94.9/src/osd/ReplicatedPG.cc#L
104
85
The only thing I could think of from looking at in the code is that
the function loops through all hitsets that are above the max
number (hit_set_count). I wonder if setting this number higher
would
mean it won't try and trim any hitsets and let things recover?
Post by Nick Fisk
DISCLAIMER
This is a hunch, it might not work or could possibly even make
things worse. Otherwise wait for someone who has a better idea to comment.
Nick
-----Original Message-----
Sent: 23 November 2016 05:57
Subject: [ceph-users] Ceph strange issue after adding a cache OSD.
Hello,
The story goes like this.
I have added another 3 drives to the caching layer. OSDs were
added to crush map one by one after each successful rebalance.
When
I
added the last OSD and went away for about an hour I noticed that
it's still not finished rebalancing. Further investigation
showed me
that it one of the older cache SSD was restarting like crazy
before full boot. So I shut it down and waited for a rebalance
without that
OSD. Less than an hour later I had another 2 OSD restarting like
crazy. I tried running scrubs on the PG's logs asked me to, but
that did
not help. I'm currently stuck with " 8 scrub errors" and a complete dead cluster.
log_channel(cluster) log [WRN] : pg 15.8d has invalid (post-split)
stats; must scrub before tier agent can activate
0> 2016-11-23 06:41:43.365602 7f935b4eb700 -1
osd/ReplicatedPG.cc: In function 'void
ReplicatedPG::hit_set_trim(ReplicatedPG::RepGather*, unsigned int)'
thread 7f935b4eb700 time 2016-11-23 06:41:43.363067
osd/ReplicatedPG.cc: 10521: FAILED assert(obc)
ceph version 0.94.9 (fe6d859066244b97b24f09d46552afc2071e6f90)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x85) [0xbde2c5]
2: (ReplicatedPG::hit_set_trim(ReplicatedPG::RepGather*, unsigned
int)+0x75f) [0x87e89f]
3: (ReplicatedPG::hit_set_persist()+0xedb) [0x87f8bb]
4: (ReplicatedPG::do_op(std::tr1::shared_ptr<OpRequest>&)+0xe3a) [0x8a11aa]
5: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>&,
ThreadPool::TPHandle&)+0x68a) [0x83c37a]
6: (OSD::dequeue_op(boost::intrusive_ptr<PG>,
std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x405) [0x69af05]
7: (OSD::ShardedOpWQ::_process(unsigned int,
ceph::heartbeat_handle_d*)+0x333) [0x69b473]
8: (ShardedThreadPool::shardedthreadpool_worker(unsigned
int)+0x86f) [0xbcd9cf]
9: (ShardedThreadPool::WorkThreadSharded::entry()+0x10)
[0xbcfb00]
10: (()+0x7dc5) [0x7f93b9df4dc5]
11: (clone()+0x6d) [0x7f93b88d5ced]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
I have tried looking with full debug enabled, but those logs
didn't help me much. I have tried to evict the cache layer, but
some objects are stuck and can't be removed. Any suggestions would
be
greatly appreciated.
Post by Nick Fisk
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Daznis
2016-11-23 13:55:31 UTC
Permalink
No, it's still missing some PGs and objects and can't recover as it's
blocked by that OSD. I can boot the OSD up by removing all the PG
related files from current directory, but that doesn't solve the
missing objects problem. Not really sure if I can move the object back
to their place manually, but I will try it.
Post by Nick Fisk
Sorry, I'm afraid I'm out of ideas about that one, that error doesn't mean very much to me. The code suggests the OSD is trying to get an attr from the disk/filesystem, but for some reason it doesn't like that. You could maybe whack the debug logging for OSD and filestore up to max and try and see what PG/file is accessed just before the crash, but I'm not sure what the fix would be, even if you manage to locate the dodgy PG.
Does the cluster have all PG's recovered now? Unless anyone else can comment, you might be best removing/wiping and then re-adding the OSD.
-----Original Message-----
Sent: 23 November 2016 12:55
Subject: Re: [ceph-users] Ceph strange issue after adding a cache OSD.
osd/PG.cc: In function 'static int PG::peek_map_epoch(ObjectStore*, spg_t, epoch_t*, ceph::bufferlist*)' thread 7f36bbdd6880 time
2016-11-23 13:42:43.27
8539
osd/PG.cc: 2911: FAILED assert(r > 0)
ceph version 0.94.9 (fe6d859066244b97b24f09d46552afc2071e6f90)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x85) [0xbde2c5]
2: (PG::peek_map_epoch(ObjectStore*, spg_t, unsigned int*,
ceph::buffer::list*)+0x8ba) [0x7cf4da]
3: (OSD::load_pgs()+0x9ef) [0x6bd31f]
4: (OSD::init()+0x181a) [0x6c0e8a]
5: (main()+0x29dd) [0x6484bd]
6: (__libc_start_main()+0xf5) [0x7f36b916bb15]
7: /usr/bin/ceph-osd() [0x661ea9]
Post by Nick Fisk
-----Original Message-----
Sent: 23 November 2016 10:17
Subject: Re: [ceph-users] Ceph strange issue after adding a cache OSD.
Hi,
Looks like one of my colleagues increased the PG number before it
finished. I was flushing the whole cache tier and it's currently
stuck on ~80 GB of data, because of the OSD crashes. I will look into the hitset counts and check what can be done. Will provide an
update if I find anything or fix the issue.
Post by Nick Fisk
So I'm guessing when the PG split, the stats/hit_sets are not how the OSD is expecting them to be and causes the crash. I would
expect this has been caused by the PG splitting rather than introducing extra OSD's. If you manage to get things stable by bumping up
the hitset count, then you probably want to try and do a scrub to try and clean up the stats, which may then stop this happening when
the hitset comes round to being trimmed again.
Post by Nick Fisk
Post by Nick Fisk
Hi Daznis,
I'm not sure how much help I can be, but I will try my best.
I think the post-split stats error is probably benign, although I
think this suggests you also increased the number of PG's in your
cache pool? If so did you do this before or after you added the
extra OSD's? This may have been the cause.
Post by Nick Fisk
On to the actual assert, this looks like it's part of the code
which trims the tiering hit set's. I don't understand why its
crashing out, but it must be related to an invalid or missing
hitset I would
imagine.
Post by Nick Fisk
https://github.com/ceph/ceph/blob/v0.94.9/src/osd/ReplicatedPG.cc#L
104
85
The only thing I could think of from looking at in the code is that
the function loops through all hitsets that are above the max
number (hit_set_count). I wonder if setting this number higher
would
mean it won't try and trim any hitsets and let things recover?
Post by Nick Fisk
DISCLAIMER
This is a hunch, it might not work or could possibly even make
things worse. Otherwise wait for someone who has a better idea to comment.
Nick
-----Original Message-----
Behalf Of Daznis
Sent: 23 November 2016 05:57
Subject: [ceph-users] Ceph strange issue after adding a cache OSD.
Hello,
The story goes like this.
I have added another 3 drives to the caching layer. OSDs were
added to crush map one by one after each successful rebalance.
When
I
added the last OSD and went away for about an hour I noticed that
it's still not finished rebalancing. Further investigation
showed me
that it one of the older cache SSD was restarting like crazy
before full boot. So I shut it down and waited for a rebalance
without that
OSD. Less than an hour later I had another 2 OSD restarting like
crazy. I tried running scrubs on the PG's logs asked me to, but
that did
not help. I'm currently stuck with " 8 scrub errors" and a complete dead cluster.
log_channel(cluster) log [WRN] : pg 15.8d has invalid (post-split)
stats; must scrub before tier agent can activate
0> 2016-11-23 06:41:43.365602 7f935b4eb700 -1
osd/ReplicatedPG.cc: In function 'void
ReplicatedPG::hit_set_trim(ReplicatedPG::RepGather*, unsigned int)'
thread 7f935b4eb700 time 2016-11-23 06:41:43.363067
osd/ReplicatedPG.cc: 10521: FAILED assert(obc)
ceph version 0.94.9 (fe6d859066244b97b24f09d46552afc2071e6f90)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x85) [0xbde2c5]
2: (ReplicatedPG::hit_set_trim(ReplicatedPG::RepGather*, unsigned
int)+0x75f) [0x87e89f]
3: (ReplicatedPG::hit_set_persist()+0xedb) [0x87f8bb]
4: (ReplicatedPG::do_op(std::tr1::shared_ptr<OpRequest>&)+0xe3a) [0x8a11aa]
5: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>&,
ThreadPool::TPHandle&)+0x68a) [0x83c37a]
6: (OSD::dequeue_op(boost::intrusive_ptr<PG>,
std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x405) [0x69af05]
7: (OSD::ShardedOpWQ::_process(unsigned int,
ceph::heartbeat_handle_d*)+0x333) [0x69b473]
8: (ShardedThreadPool::shardedthreadpool_worker(unsigned
int)+0x86f) [0xbcd9cf]
9: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xbcfb00]
10: (()+0x7dc5) [0x7f93b9df4dc5]
11: (clone()+0x6d) [0x7f93b88d5ced]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
I have tried looking with full debug enabled, but those logs
didn't help me much. I have tried to evict the cache layer, but
some objects are stuck and can't be removed. Any suggestions would
be
greatly appreciated.
Post by Nick Fisk
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Nick Fisk
2016-11-23 14:00:01 UTC
Permalink
I take it you have size =2 or min_size=1 or something like that for the cache pool? 1 OSD shouldn’t prevent PG's from recovering.

Your best bet would be to see if the PG that is causing the assert can be removed and let the OSD start up. If you are lucky, the PG causing the problems might not be one which also has unfound objects, otherwise you are likely have to get heavily involved in recovering objects with the object store tool.
-----Original Message-----
Sent: 23 November 2016 13:56
Subject: Re: [ceph-users] Ceph strange issue after adding a cache OSD.
No, it's still missing some PGs and objects and can't recover as it's blocked by that OSD. I can boot the OSD up by removing all the PG
related files from current directory, but that doesn't solve the missing objects problem. Not really sure if I can move the object back to
their place manually, but I will try it.
Post by Nick Fisk
Sorry, I'm afraid I'm out of ideas about that one, that error doesn't mean very much to me. The code suggests the OSD is trying to
get an attr from the disk/filesystem, but for some reason it doesn't like that. You could maybe whack the debug logging for OSD and
filestore up to max and try and see what PG/file is accessed just before the crash, but I'm not sure what the fix would be, even if you
manage to locate the dodgy PG.
Post by Nick Fisk
Does the cluster have all PG's recovered now? Unless anyone else can comment, you might be best removing/wiping and then re-
adding the OSD.
Post by Nick Fisk
-----Original Message-----
Sent: 23 November 2016 12:55
Subject: Re: [ceph-users] Ceph strange issue after adding a cache OSD.
osd/PG.cc: In function 'static int PG::peek_map_epoch(ObjectStore*,
spg_t, epoch_t*, ceph::bufferlist*)' thread 7f36bbdd6880 time
2016-11-23 13:42:43.27
8539
osd/PG.cc: 2911: FAILED assert(r > 0)
ceph version 0.94.9 (fe6d859066244b97b24f09d46552afc2071e6f90)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x85) [0xbde2c5]
2: (PG::peek_map_epoch(ObjectStore*, spg_t, unsigned int*,
ceph::buffer::list*)+0x8ba) [0x7cf4da]
3: (OSD::load_pgs()+0x9ef) [0x6bd31f]
4: (OSD::init()+0x181a) [0x6c0e8a]
5: (main()+0x29dd) [0x6484bd]
6: (__libc_start_main()+0xf5) [0x7f36b916bb15]
7: /usr/bin/ceph-osd() [0x661ea9]
Post by Nick Fisk
-----Original Message-----
Sent: 23 November 2016 10:17
Subject: Re: [ceph-users] Ceph strange issue after adding a cache OSD.
Hi,
Looks like one of my colleagues increased the PG number before it
finished. I was flushing the whole cache tier and it's currently
stuck on ~80 GB of data, because of the OSD crashes. I will look
into the hitset counts and check what can be done. Will provide an
update if I find anything or fix the issue.
Post by Nick Fisk
So I'm guessing when the PG split, the stats/hit_sets are not how
the OSD is expecting them to be and causes the crash. I would
expect this has been caused by the PG splitting rather than
introducing extra OSD's. If you manage to get things stable by
bumping up the hitset count, then you probably want to try and do a scrub to try and clean up the stats, which may then stop this
happening when the hitset comes round to being trimmed again.
Post by Nick Fisk
Post by Nick Fisk
Post by Nick Fisk
Hi Daznis,
I'm not sure how much help I can be, but I will try my best.
I think the post-split stats error is probably benign, although
I think this suggests you also increased the number of PG's in
your cache pool? If so did you do this before or after you added
the
extra OSD's? This may have been the cause.
Post by Nick Fisk
On to the actual assert, this looks like it's part of the code
which trims the tiering hit set's. I don't understand why its
crashing out, but it must be related to an invalid or missing
hitset I would
imagine.
Post by Nick Fisk
https://github.com/ceph/ceph/blob/v0.94.9/src/osd/ReplicatedPG.c
c#L
104
85
The only thing I could think of from looking at in the code is
that the function loops through all hitsets that are above the
max number (hit_set_count). I wonder if setting this number
higher would
mean it won't try and trim any hitsets and let things recover?
Post by Nick Fisk
DISCLAIMER
This is a hunch, it might not work or could possibly even make
things worse. Otherwise wait for someone who has a better idea to comment.
Nick
-----Original Message-----
Behalf Of Daznis
Sent: 23 November 2016 05:57
Subject: [ceph-users] Ceph strange issue after adding a cache OSD.
Hello,
The story goes like this.
I have added another 3 drives to the caching layer. OSDs were
added to crush map one by one after each successful rebalance.
When
I
added the last OSD and went away for about an hour I noticed
that it's still not finished rebalancing. Further investigation
showed me
that it one of the older cache SSD was restarting like crazy
before full boot. So I shut it down and waited for a rebalance
without that
OSD. Less than an hour later I had another 2 OSD restarting
like crazy. I tried running scrubs on the PG's logs asked me
to, but
that did
not help. I'm currently stuck with " 8 scrub errors" and a complete dead cluster.
log_channel(cluster) log [WRN] : pg 15.8d has invalid
(post-split) stats; must scrub before tier agent can activate
0> 2016-11-23 06:41:43.365602 7f935b4eb700 -1
osd/ReplicatedPG.cc: In function 'void
ReplicatedPG::hit_set_trim(ReplicatedPG::RepGather*, unsigned int)'
thread 7f935b4eb700 time 2016-11-23 06:41:43.363067
osd/ReplicatedPG.cc: 10521: FAILED assert(obc)
ceph version 0.94.9 (fe6d859066244b97b24f09d46552afc2071e6f90)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x85) [0xbde2c5]
2: (ReplicatedPG::hit_set_trim(ReplicatedPG::RepGather*, unsigned
int)+0x75f) [0x87e89f]
3: (ReplicatedPG::hit_set_persist()+0xedb) [0x87f8bb]
(ReplicatedPG::do_op(std::tr1::shared_ptr<OpRequest>&)+0xe3a)
[0x8a11aa]
5: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>&,
ThreadPool::TPHandle&)+0x68a) [0x83c37a]
6: (OSD::dequeue_op(boost::intrusive_ptr<PG>,
std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x405) [0x69af05]
7: (OSD::ShardedOpWQ::_process(unsigned int,
ceph::heartbeat_handle_d*)+0x333) [0x69b473]
8: (ShardedThreadPool::shardedthreadpool_worker(unsigned
int)+0x86f) [0xbcd9cf]
9: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xbcfb00]
10: (()+0x7dc5) [0x7f93b9df4dc5]
11: (clone()+0x6d) [0x7f93b88d5ced]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
I have tried looking with full debug enabled, but those logs
didn't help me much. I have tried to evict the cache layer, but
some objects are stuck and can't be removed. Any suggestions
would be
greatly appreciated.
Post by Nick Fisk
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Daznis
2016-11-24 15:42:51 UTC
Permalink
Yes, unfortunately, it is. And the story still continues. I have
noticed that only 4 OSD are doing this and zapping and readding it
does not solve the issue. Removing them completely from the cluster
solve that issue, but I can't reuse their ID's. If I add another one
with the same ID it starts doing the same "funky" crashes. For now the
cluster remains "stable" without the OSD's.
Post by Nick Fisk
I take it you have size =2 or min_size=1 or something like that for the cache pool? 1 OSD shouldn’t prevent PG's from recovering.
Your best bet would be to see if the PG that is causing the assert can be removed and let the OSD start up. If you are lucky, the PG causing the problems might not be one which also has unfound objects, otherwise you are likely have to get heavily involved in recovering objects with the object store tool.
-----Original Message-----
Sent: 23 November 2016 13:56
Subject: Re: [ceph-users] Ceph strange issue after adding a cache OSD.
No, it's still missing some PGs and objects and can't recover as it's blocked by that OSD. I can boot the OSD up by removing all the PG
related files from current directory, but that doesn't solve the missing objects problem. Not really sure if I can move the object back to
their place manually, but I will try it.
Post by Nick Fisk
Sorry, I'm afraid I'm out of ideas about that one, that error doesn't mean very much to me. The code suggests the OSD is trying to
get an attr from the disk/filesystem, but for some reason it doesn't like that. You could maybe whack the debug logging for OSD and
filestore up to max and try and see what PG/file is accessed just before the crash, but I'm not sure what the fix would be, even if you
manage to locate the dodgy PG.
Post by Nick Fisk
Does the cluster have all PG's recovered now? Unless anyone else can comment, you might be best removing/wiping and then re-
adding the OSD.
Post by Nick Fisk
-----Original Message-----
Sent: 23 November 2016 12:55
Subject: Re: [ceph-users] Ceph strange issue after adding a cache OSD.
osd/PG.cc: In function 'static int PG::peek_map_epoch(ObjectStore*,
spg_t, epoch_t*, ceph::bufferlist*)' thread 7f36bbdd6880 time
2016-11-23 13:42:43.27
8539
osd/PG.cc: 2911: FAILED assert(r > 0)
ceph version 0.94.9 (fe6d859066244b97b24f09d46552afc2071e6f90)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x85) [0xbde2c5]
2: (PG::peek_map_epoch(ObjectStore*, spg_t, unsigned int*,
ceph::buffer::list*)+0x8ba) [0x7cf4da]
3: (OSD::load_pgs()+0x9ef) [0x6bd31f]
4: (OSD::init()+0x181a) [0x6c0e8a]
5: (main()+0x29dd) [0x6484bd]
6: (__libc_start_main()+0xf5) [0x7f36b916bb15]
7: /usr/bin/ceph-osd() [0x661ea9]
Post by Nick Fisk
-----Original Message-----
Sent: 23 November 2016 10:17
Subject: Re: [ceph-users] Ceph strange issue after adding a cache OSD.
Hi,
Looks like one of my colleagues increased the PG number before it
finished. I was flushing the whole cache tier and it's currently
stuck on ~80 GB of data, because of the OSD crashes. I will look
into the hitset counts and check what can be done. Will provide an
update if I find anything or fix the issue.
Post by Nick Fisk
So I'm guessing when the PG split, the stats/hit_sets are not how
the OSD is expecting them to be and causes the crash. I would
expect this has been caused by the PG splitting rather than
introducing extra OSD's. If you manage to get things stable by
bumping up the hitset count, then you probably want to try and do a scrub to try and clean up the stats, which may then stop this
happening when the hitset comes round to being trimmed again.
Post by Nick Fisk
Post by Nick Fisk
Post by Nick Fisk
Hi Daznis,
I'm not sure how much help I can be, but I will try my best.
I think the post-split stats error is probably benign, although
I think this suggests you also increased the number of PG's in
your cache pool? If so did you do this before or after you added
the
extra OSD's? This may have been the cause.
Post by Nick Fisk
On to the actual assert, this looks like it's part of the code
which trims the tiering hit set's. I don't understand why its
crashing out, but it must be related to an invalid or missing
hitset I would
imagine.
Post by Nick Fisk
https://github.com/ceph/ceph/blob/v0.94.9/src/osd/ReplicatedPG.c
c#L
104
85
The only thing I could think of from looking at in the code is
that the function loops through all hitsets that are above the
max number (hit_set_count). I wonder if setting this number
higher would
mean it won't try and trim any hitsets and let things recover?
Post by Nick Fisk
DISCLAIMER
This is a hunch, it might not work or could possibly even make
things worse. Otherwise wait for someone who has a better idea to comment.
Nick
-----Original Message-----
Behalf Of Daznis
Sent: 23 November 2016 05:57
Subject: [ceph-users] Ceph strange issue after adding a cache OSD.
Hello,
The story goes like this.
I have added another 3 drives to the caching layer. OSDs were
added to crush map one by one after each successful rebalance.
When
I
added the last OSD and went away for about an hour I noticed
that it's still not finished rebalancing. Further investigation
showed me
that it one of the older cache SSD was restarting like crazy
before full boot. So I shut it down and waited for a rebalance
without that
OSD. Less than an hour later I had another 2 OSD restarting
like crazy. I tried running scrubs on the PG's logs asked me
to, but
that did
not help. I'm currently stuck with " 8 scrub errors" and a complete dead cluster.
log_channel(cluster) log [WRN] : pg 15.8d has invalid
(post-split) stats; must scrub before tier agent can activate
0> 2016-11-23 06:41:43.365602 7f935b4eb700 -1
osd/ReplicatedPG.cc: In function 'void
ReplicatedPG::hit_set_trim(ReplicatedPG::RepGather*, unsigned int)'
thread 7f935b4eb700 time 2016-11-23 06:41:43.363067
osd/ReplicatedPG.cc: 10521: FAILED assert(obc)
ceph version 0.94.9 (fe6d859066244b97b24f09d46552afc2071e6f90)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x85) [0xbde2c5]
2: (ReplicatedPG::hit_set_trim(ReplicatedPG::RepGather*, unsigned
int)+0x75f) [0x87e89f]
3: (ReplicatedPG::hit_set_persist()+0xedb) [0x87f8bb]
(ReplicatedPG::do_op(std::tr1::shared_ptr<OpRequest>&)+0xe3a)
[0x8a11aa]
5: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>&,
ThreadPool::TPHandle&)+0x68a) [0x83c37a]
6: (OSD::dequeue_op(boost::intrusive_ptr<PG>,
std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x405)
[0x69af05]
7: (OSD::ShardedOpWQ::_process(unsigned int,
ceph::heartbeat_handle_d*)+0x333) [0x69b473]
8: (ShardedThreadPool::shardedthreadpool_worker(unsigned
int)+0x86f) [0xbcd9cf]
9: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xbcfb00]
10: (()+0x7dc5) [0x7f93b9df4dc5]
11: (clone()+0x6d) [0x7f93b88d5ced]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
I have tried looking with full debug enabled, but those logs
didn't help me much. I have tried to evict the cache layer, but
some objects are stuck and can't be removed. Any suggestions
would be
greatly appreciated.
Post by Nick Fisk
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Nick Fisk
2016-11-24 16:05:29 UTC
Permalink
Can you add them with different ID's, it won't look pretty but might get you out of this situation?
-----Original Message-----
Sent: 24 November 2016 15:43
Subject: Re: [ceph-users] Ceph strange issue after adding a cache OSD.
Yes, unfortunately, it is. And the story still continues. I have noticed that only 4 OSD are doing this and zapping and readding it does
not solve the issue. Removing them completely from the cluster solve that issue, but I can't reuse their ID's. If I add another one with
the same ID it starts doing the same "funky" crashes. For now the cluster remains "stable" without the OSD's.
Post by Nick Fisk
I take it you have size =2 or min_size=1 or something like that for the cache pool? 1 OSD shouldn’t prevent PG's from recovering.
Your best bet would be to see if the PG that is causing the assert can be removed and let the OSD start up. If you are lucky, the PG
causing the problems might not be one which also has unfound objects, otherwise you are likely have to get heavily involved in
recovering objects with the object store tool.
Post by Nick Fisk
-----Original Message-----
Sent: 23 November 2016 13:56
Subject: Re: [ceph-users] Ceph strange issue after adding a cache OSD.
No, it's still missing some PGs and objects and can't recover as it's
blocked by that OSD. I can boot the OSD up by removing all the PG
related files from current directory, but that doesn't solve the missing objects problem. Not really sure if I can move the object
back to their place manually, but I will try it.
Post by Nick Fisk
Post by Nick Fisk
Sorry, I'm afraid I'm out of ideas about that one, that error
doesn't mean very much to me. The code suggests the OSD is trying
to
get an attr from the disk/filesystem, but for some reason it doesn't
like that. You could maybe whack the debug logging for OSD and
filestore up to max and try and see what PG/file is accessed just before the crash, but I'm not sure what the fix would be, even if
you manage to locate the dodgy PG.
Post by Nick Fisk
Post by Nick Fisk
Does the cluster have all PG's recovered now? Unless anyone else
can comment, you might be best removing/wiping and then re-
adding the OSD.
Post by Nick Fisk
-----Original Message-----
Sent: 23 November 2016 12:55
Subject: Re: [ceph-users] Ceph strange issue after adding a cache OSD.
osd/PG.cc: In function 'static int
PG::peek_map_epoch(ObjectStore*, spg_t, epoch_t*,
ceph::bufferlist*)' thread 7f36bbdd6880 time
2016-11-23 13:42:43.27
8539
osd/PG.cc: 2911: FAILED assert(r > 0)
ceph version 0.94.9 (fe6d859066244b97b24f09d46552afc2071e6f90)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x85) [0xbde2c5]
2: (PG::peek_map_epoch(ObjectStore*, spg_t, unsigned int*,
ceph::buffer::list*)+0x8ba) [0x7cf4da]
3: (OSD::load_pgs()+0x9ef) [0x6bd31f]
4: (OSD::init()+0x181a) [0x6c0e8a]
5: (main()+0x29dd) [0x6484bd]
6: (__libc_start_main()+0xf5) [0x7f36b916bb15]
7: /usr/bin/ceph-osd() [0x661ea9]
Post by Nick Fisk
-----Original Message-----
Sent: 23 November 2016 10:17
Subject: Re: [ceph-users] Ceph strange issue after adding a cache OSD.
Hi,
Looks like one of my colleagues increased the PG number before
it finished. I was flushing the whole cache tier and it's
currently stuck on ~80 GB of data, because of the OSD crashes.
I will look into the hitset counts and check what can be done.
Will provide an
update if I find anything or fix the issue.
Post by Nick Fisk
So I'm guessing when the PG split, the stats/hit_sets are not
how the OSD is expecting them to be and causes the crash. I
would
expect this has been caused by the PG splitting rather than
introducing extra OSD's. If you manage to get things stable by
bumping up the hitset count, then you probably want to try and do
a scrub to try and clean up the stats, which may then stop this
happening when the hitset comes round to being trimmed again.
Post by Nick Fisk
Post by Nick Fisk
Post by Nick Fisk
Hi Daznis,
I'm not sure how much help I can be, but I will try my best.
I think the post-split stats error is probably benign,
although I think this suggests you also increased the number
of PG's in your cache pool? If so did you do this before or
after you added the
extra OSD's? This may have been the cause.
Post by Nick Fisk
On to the actual assert, this looks like it's part of the
code which trims the tiering hit set's. I don't understand
why its crashing out, but it must be related to an invalid or
missing hitset I would
imagine.
Post by Nick Fisk
https://github.com/ceph/ceph/blob/v0.94.9/src/osd/ReplicatedP
G.c
c#L
104
85
The only thing I could think of from looking at in the code
is that the function loops through all hitsets that are above
the max number (hit_set_count). I wonder if setting this
number higher would
mean it won't try and trim any hitsets and let things recover?
Post by Nick Fisk
DISCLAIMER
This is a hunch, it might not work or could possibly even
make things worse. Otherwise wait for someone who has a better idea to comment.
Nick
-----Original Message-----
On Behalf Of Daznis
Sent: 23 November 2016 05:57
Subject: [ceph-users] Ceph strange issue after adding a cache OSD.
Hello,
The story goes like this.
I have added another 3 drives to the caching layer. OSDs
were added to crush map one by one after each successful rebalance.
When
I
added the last OSD and went away for about an hour I noticed
that it's still not finished rebalancing. Further
investigation
showed me
that it one of the older cache SSD was restarting like crazy
before full boot. So I shut it down and waited for a rebalance
without that
OSD. Less than an hour later I had another 2 OSD restarting
like crazy. I tried running scrubs on the PG's logs asked me
to, but
that did
not help. I'm currently stuck with " 8 scrub errors" and a complete dead cluster.
log_channel(cluster) log [WRN] : pg 15.8d has invalid
(post-split) stats; must scrub before tier agent can activate
0> 2016-11-23 06:41:43.365602 7f935b4eb700 -1
osd/ReplicatedPG.cc: In function 'void
ReplicatedPG::hit_set_trim(ReplicatedPG::RepGather*, unsigned int)'
thread 7f935b4eb700 time 2016-11-23 06:41:43.363067
osd/ReplicatedPG.cc: 10521: FAILED assert(obc)
ceph version 0.94.9
(fe6d859066244b97b24f09d46552afc2071e6f90)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x85) [0xbde2c5]
2: (ReplicatedPG::hit_set_trim(ReplicatedPG::RepGather*, unsigned
int)+0x75f) [0x87e89f]
3: (ReplicatedPG::hit_set_persist()+0xedb) [0x87f8bb]
(ReplicatedPG::do_op(std::tr1::shared_ptr<OpRequest>&)+0xe3a
)
[0x8a11aa]
(ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>&,
ThreadPool::TPHandle&)+0x68a) [0x83c37a]
6: (OSD::dequeue_op(boost::intrusive_ptr<PG>,
std::tr1::shared_ptr<OpRequest>,
ThreadPool::TPHandle&)+0x405) [0x69af05]
7: (OSD::ShardedOpWQ::_process(unsigned int,
ceph::heartbeat_handle_d*)+0x333) [0x69b473]
8: (ShardedThreadPool::shardedthreadpool_worker(unsigned
int)+0x86f) [0xbcd9cf]
9: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xbcfb00]
10: (()+0x7dc5) [0x7f93b9df4dc5]
11: (clone()+0x6d) [0x7f93b88d5ced]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
I have tried looking with full debug enabled, but those
logs didn't help me much. I have tried to evict the cache
layer, but some objects are stuck and can't be removed. Any
suggestions would be
greatly appreciated.
Post by Nick Fisk
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Daznis
2016-11-24 19:43:57 UTC
Permalink
I will try it, but I wanna see if it stays stable for a few days. Not
sure if I should report this bug or not.
Post by Nick Fisk
Can you add them with different ID's, it won't look pretty but might get you out of this situation?
-----Original Message-----
Sent: 24 November 2016 15:43
Subject: Re: [ceph-users] Ceph strange issue after adding a cache OSD.
Yes, unfortunately, it is. And the story still continues. I have noticed that only 4 OSD are doing this and zapping and readding it does
not solve the issue. Removing them completely from the cluster solve that issue, but I can't reuse their ID's. If I add another one with
the same ID it starts doing the same "funky" crashes. For now the cluster remains "stable" without the OSD's.
Post by Nick Fisk
I take it you have size =2 or min_size=1 or something like that for the cache pool? 1 OSD shouldn’t prevent PG's from recovering.
Your best bet would be to see if the PG that is causing the assert can be removed and let the OSD start up. If you are lucky, the PG
causing the problems might not be one which also has unfound objects, otherwise you are likely have to get heavily involved in
recovering objects with the object store tool.
Post by Nick Fisk
-----Original Message-----
Sent: 23 November 2016 13:56
Subject: Re: [ceph-users] Ceph strange issue after adding a cache OSD.
No, it's still missing some PGs and objects and can't recover as it's
blocked by that OSD. I can boot the OSD up by removing all the PG
related files from current directory, but that doesn't solve the missing objects problem. Not really sure if I can move the object
back to their place manually, but I will try it.
Post by Nick Fisk
Post by Nick Fisk
Sorry, I'm afraid I'm out of ideas about that one, that error
doesn't mean very much to me. The code suggests the OSD is trying
to
get an attr from the disk/filesystem, but for some reason it doesn't
like that. You could maybe whack the debug logging for OSD and
filestore up to max and try and see what PG/file is accessed just before the crash, but I'm not sure what the fix would be, even if
you manage to locate the dodgy PG.
Post by Nick Fisk
Post by Nick Fisk
Does the cluster have all PG's recovered now? Unless anyone else
can comment, you might be best removing/wiping and then re-
adding the OSD.
Post by Nick Fisk
-----Original Message-----
Sent: 23 November 2016 12:55
Subject: Re: [ceph-users] Ceph strange issue after adding a cache OSD.
osd/PG.cc: In function 'static int
PG::peek_map_epoch(ObjectStore*, spg_t, epoch_t*,
ceph::bufferlist*)' thread 7f36bbdd6880 time
2016-11-23 13:42:43.27
8539
osd/PG.cc: 2911: FAILED assert(r > 0)
ceph version 0.94.9 (fe6d859066244b97b24f09d46552afc2071e6f90)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x85) [0xbde2c5]
2: (PG::peek_map_epoch(ObjectStore*, spg_t, unsigned int*,
ceph::buffer::list*)+0x8ba) [0x7cf4da]
3: (OSD::load_pgs()+0x9ef) [0x6bd31f]
4: (OSD::init()+0x181a) [0x6c0e8a]
5: (main()+0x29dd) [0x6484bd]
6: (__libc_start_main()+0xf5) [0x7f36b916bb15]
7: /usr/bin/ceph-osd() [0x661ea9]
Post by Nick Fisk
-----Original Message-----
Sent: 23 November 2016 10:17
Subject: Re: [ceph-users] Ceph strange issue after adding a cache OSD.
Hi,
Looks like one of my colleagues increased the PG number before
it finished. I was flushing the whole cache tier and it's
currently stuck on ~80 GB of data, because of the OSD crashes.
I will look into the hitset counts and check what can be done.
Will provide an
update if I find anything or fix the issue.
Post by Nick Fisk
So I'm guessing when the PG split, the stats/hit_sets are not
how the OSD is expecting them to be and causes the crash. I
would
expect this has been caused by the PG splitting rather than
introducing extra OSD's. If you manage to get things stable by
bumping up the hitset count, then you probably want to try and do
a scrub to try and clean up the stats, which may then stop this
happening when the hitset comes round to being trimmed again.
Post by Nick Fisk
Post by Nick Fisk
Post by Nick Fisk
Hi Daznis,
I'm not sure how much help I can be, but I will try my best.
I think the post-split stats error is probably benign,
although I think this suggests you also increased the number
of PG's in your cache pool? If so did you do this before or
after you added the
extra OSD's? This may have been the cause.
Post by Nick Fisk
On to the actual assert, this looks like it's part of the
code which trims the tiering hit set's. I don't understand
why its crashing out, but it must be related to an invalid or
missing hitset I would
imagine.
Post by Nick Fisk
https://github.com/ceph/ceph/blob/v0.94.9/src/osd/ReplicatedP
G.c
c#L
104
85
The only thing I could think of from looking at in the code
is that the function loops through all hitsets that are above
the max number (hit_set_count). I wonder if setting this
number higher would
mean it won't try and trim any hitsets and let things recover?
Post by Nick Fisk
DISCLAIMER
This is a hunch, it might not work or could possibly even
make things worse. Otherwise wait for someone who has a better idea to comment.
Nick
-----Original Message-----
On Behalf Of Daznis
Sent: 23 November 2016 05:57
Subject: [ceph-users] Ceph strange issue after adding a cache OSD.
Hello,
The story goes like this.
I have added another 3 drives to the caching layer. OSDs
were added to crush map one by one after each successful rebalance.
When
I
added the last OSD and went away for about an hour I noticed
that it's still not finished rebalancing. Further
investigation
showed me
that it one of the older cache SSD was restarting like crazy
before full boot. So I shut it down and waited for a rebalance
without that
OSD. Less than an hour later I had another 2 OSD restarting
like crazy. I tried running scrubs on the PG's logs asked me
to, but
that did
not help. I'm currently stuck with " 8 scrub errors" and a complete dead cluster.
log_channel(cluster) log [WRN] : pg 15.8d has invalid
(post-split) stats; must scrub before tier agent can activate
0> 2016-11-23 06:41:43.365602 7f935b4eb700 -1
osd/ReplicatedPG.cc: In function 'void
ReplicatedPG::hit_set_trim(ReplicatedPG::RepGather*, unsigned int)'
thread 7f935b4eb700 time 2016-11-23 06:41:43.363067
osd/ReplicatedPG.cc: 10521: FAILED assert(obc)
ceph version 0.94.9
(fe6d859066244b97b24f09d46552afc2071e6f90)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x85) [0xbde2c5]
2: (ReplicatedPG::hit_set_trim(ReplicatedPG::RepGather*,
unsigned
int)+0x75f) [0x87e89f]
3: (ReplicatedPG::hit_set_persist()+0xedb) [0x87f8bb]
(ReplicatedPG::do_op(std::tr1::shared_ptr<OpRequest>&)+0xe3a
)
[0x8a11aa]
(ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>&,
ThreadPool::TPHandle&)+0x68a) [0x83c37a]
6: (OSD::dequeue_op(boost::intrusive_ptr<PG>,
std::tr1::shared_ptr<OpRequest>,
ThreadPool::TPHandle&)+0x405) [0x69af05]
7: (OSD::ShardedOpWQ::_process(unsigned int,
ceph::heartbeat_handle_d*)+0x333) [0x69b473]
8: (ShardedThreadPool::shardedthreadpool_worker(unsigned
int)+0x86f) [0xbcd9cf]
9: (ShardedThreadPool::WorkThreadSharded::entry()+0x10)
[0xbcfb00]
10: (()+0x7dc5) [0x7f93b9df4dc5]
11: (clone()+0x6d) [0x7f93b88d5ced]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
I have tried looking with full debug enabled, but those
logs didn't help me much. I have tried to evict the cache
layer, but some objects are stuck and can't be removed. Any
suggestions would be
greatly appreciated.
Post by Nick Fisk
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Nick Fisk
2016-11-25 10:26:32 UTC
Permalink
Possibly, do you know the exact steps to reproduce? I'm guessing the PG splitting was the cause, but whether this on its own would cause the problem or also needs the introduction of new OSD's at the same time, might make tracing the cause hard.
-----Original Message-----
Sent: 24 November 2016 19:44
Subject: Re: [ceph-users] Ceph strange issue after adding a cache OSD.
I will try it, but I wanna see if it stays stable for a few days. Not sure if I should report this bug or not.
Post by Nick Fisk
Can you add them with different ID's, it won't look pretty but might get you out of this situation?
-----Original Message-----
Sent: 24 November 2016 15:43
Subject: Re: [ceph-users] Ceph strange issue after adding a cache OSD.
Yes, unfortunately, it is. And the story still continues. I have
noticed that only 4 OSD are doing this and zapping and readding it
does not solve the issue. Removing them completely from the cluster solve that issue, but I can't reuse their ID's. If I add another
one with the same ID it starts doing the same "funky" crashes. For now the cluster remains "stable" without the OSD's.
Post by Nick Fisk
Post by Nick Fisk
I take it you have size =2 or min_size=1 or something like that for the cache pool? 1 OSD shouldn’t prevent PG's from recovering.
Your best bet would be to see if the PG that is causing the assert
can be removed and let the OSD start up. If you are lucky, the PG
causing the problems might not be one which also has unfound objects,
otherwise you are likely have to get heavily involved in recovering objects with the object store tool.
Post by Nick Fisk
-----Original Message-----
Sent: 23 November 2016 13:56
Subject: Re: [ceph-users] Ceph strange issue after adding a cache OSD.
No, it's still missing some PGs and objects and can't recover as
it's blocked by that OSD. I can boot the OSD up by removing all
the PG related files from current directory, but that doesn't
solve the missing objects problem. Not really sure if I can move
the object
back to their place manually, but I will try it.
Post by Nick Fisk
Post by Nick Fisk
Sorry, I'm afraid I'm out of ideas about that one, that error
doesn't mean very much to me. The code suggests the OSD is
trying to
get an attr from the disk/filesystem, but for some reason it
doesn't like that. You could maybe whack the debug logging for OSD
and filestore up to max and try and see what PG/file is accessed
just before the crash, but I'm not sure what the fix would be,
even if
you manage to locate the dodgy PG.
Post by Nick Fisk
Post by Nick Fisk
Does the cluster have all PG's recovered now? Unless anyone else
can comment, you might be best removing/wiping and then re-
adding the OSD.
Post by Nick Fisk
-----Original Message-----
Sent: 23 November 2016 12:55
Subject: Re: [ceph-users] Ceph strange issue after adding a cache OSD.
osd/PG.cc: In function 'static int
PG::peek_map_epoch(ObjectStore*, spg_t, epoch_t*,
ceph::bufferlist*)' thread 7f36bbdd6880 time
2016-11-23 13:42:43.27
8539
osd/PG.cc: 2911: FAILED assert(r > 0)
ceph version 0.94.9 (fe6d859066244b97b24f09d46552afc2071e6f90)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x85) [0xbde2c5]
2: (PG::peek_map_epoch(ObjectStore*, spg_t, unsigned int*,
ceph::buffer::list*)+0x8ba) [0x7cf4da]
3: (OSD::load_pgs()+0x9ef) [0x6bd31f]
4: (OSD::init()+0x181a) [0x6c0e8a]
5: (main()+0x29dd) [0x6484bd]
6: (__libc_start_main()+0xf5) [0x7f36b916bb15]
7: /usr/bin/ceph-osd() [0x661ea9]
Post by Nick Fisk
-----Original Message-----
Sent: 23 November 2016 10:17
Subject: Re: [ceph-users] Ceph strange issue after adding a cache OSD.
Hi,
Looks like one of my colleagues increased the PG number
before it finished. I was flushing the whole cache tier and
it's currently stuck on ~80 GB of data, because of the OSD crashes.
I will look into the hitset counts and check what can be done.
Will provide an
update if I find anything or fix the issue.
Post by Nick Fisk
So I'm guessing when the PG split, the stats/hit_sets are not
how the OSD is expecting them to be and causes the crash. I
would
expect this has been caused by the PG splitting rather than
introducing extra OSD's. If you manage to get things stable by
bumping up the hitset count, then you probably want to try and
do a scrub to try and clean up the stats, which may then stop
this
happening when the hitset comes round to being trimmed again.
Post by Nick Fisk
Post by Nick Fisk
Post by Nick Fisk
Hi Daznis,
I'm not sure how much help I can be, but I will try my best.
I think the post-split stats error is probably benign,
although I think this suggests you also increased the
number of PG's in your cache pool? If so did you do this
before or after you added the
extra OSD's? This may have been the cause.
Post by Nick Fisk
On to the actual assert, this looks like it's part of the
code which trims the tiering hit set's. I don't understand
why its crashing out, but it must be related to an invalid
or missing hitset I would
imagine.
Post by Nick Fisk
https://github.com/ceph/ceph/blob/v0.94.9/src/osd/Replicat
edP
G.c
c#L
104
85
The only thing I could think of from looking at in the
code is that the function loops through all hitsets that
are above the max number (hit_set_count). I wonder if
setting this number higher would
mean it won't try and trim any hitsets and let things recover?
Post by Nick Fisk
DISCLAIMER
This is a hunch, it might not work or could possibly even
make things worse. Otherwise wait for someone who has a better idea to comment.
Nick
-----Original Message-----
From: ceph-users
On Behalf Of Daznis
Sent: 23 November 2016 05:57
Subject: [ceph-users] Ceph strange issue after adding a cache OSD.
Hello,
The story goes like this.
I have added another 3 drives to the caching layer. OSDs
were added to crush map one by one after each successful rebalance.
When
I
added the last OSD and went away for about an hour I
noticed that it's still not finished rebalancing. Further
investigation
showed me
that it one of the older cache SSD was restarting like
crazy before full boot. So I shut it down and waited for
a>> >> >> >> >> rebalance
without that
OSD. Less than an hour later I had another 2 OSD
restarting like crazy. I tried running scrubs on the PG's
logs asked me to, but
that did
not help. I'm currently stuck with " 8 scrub errors" and a complete dead cluster.
log_channel(cluster) log [WRN] : pg 15.8d has invalid
(post-split) stats; must scrub before tier agent can activate
0> 2016-11-23 06:41:43.365602 7f935b4eb700 -1
osd/ReplicatedPG.cc: In function 'void
ReplicatedPG::hit_set_trim(ReplicatedPG::RepGather*, unsigned int)'
thread 7f935b4eb700 time 2016-11-23 06:41:43.363067
osd/ReplicatedPG.cc: 10521: FAILED assert(obc)
ceph version 0.94.9
(fe6d859066244b97b24f09d46552afc2071e6f90)
1: (ceph::__ceph_assert_fail(char const*, char const*,
int, char
const*)+0x85) [0xbde2c5]
2: (ReplicatedPG::hit_set_trim(ReplicatedPG::RepGather*,
unsigned
int)+0x75f) [0x87e89f]
3: (ReplicatedPG::hit_set_persist()+0xedb) [0x87f8bb]
(ReplicatedPG::do_op(std::tr1::shared_ptr<OpRequest>&)+0x
e3a
)
[0x8a11aa]
(ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>
&,
ThreadPool::TPHandle&)+0x68a) [0x83c37a]
6: (OSD::dequeue_op(boost::intrusive_ptr<PG>,
std::tr1::shared_ptr<OpRequest>,
ThreadPool::TPHandle&)+0x405) [0x69af05]
7: (OSD::ShardedOpWQ::_process(unsigned int,
ceph::heartbeat_handle_d*)+0x333) [0x69b473]
8: (ShardedThreadPool::shardedthreadpool_worker(unsigned
int)+0x86f) [0xbcd9cf]
9: (ShardedThreadPool::WorkThreadSharded::entry()+0x10)
[0xbcfb00]
10: (()+0x7dc5) [0x7f93b9df4dc5]
11: (clone()+0x6d) [0x7f93b88d5ced]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
I have tried looking with full debug enabled, but those
logs didn't help me much. I have tried to evict the cache
layer, but some objects are stuck and can't be removed.
Any suggestions would be
greatly appreciated.
Post by Nick Fisk
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Daznis
2016-11-25 13:59:13 UTC
Permalink
I think it's because of these errors:

2016-11-25 14:51:25.644495 7fb73eef8700 -1 log_channel(cluster) log
[ERR] : 14.28 deep-scrub stat mismatch, got 145/144 objects, 0/0
clones, 57/57 dirty, 0/0 omap, 54/53 hit_set_archive, 0/0 whiteouts,
365399477/365399252 bytes,51328/51103 hit_set_archive bytes.

2016-11-25 14:55:56.529405 7f89bae5a700 -1 log_channel(cluster) log
[ERR] : 13.dd deep-scrub stat mismatch, got 149/148 objects, 0/0
clones, 55/55 dirty, 0/0 omap, 63/61 hit_set_archive, 0/0 whiteouts,
360765725/360765503 bytes,55581/54097 hit_set_archive bytes.

I have no clue why they appeared. The cluster was running fine for
months so I have no logs on how it happened. I just enabled them after
"shit hit the fan".
Post by Nick Fisk
Possibly, do you know the exact steps to reproduce? I'm guessing the PG splitting was the cause, but whether this on its own would cause the problem or also needs the introduction of new OSD's at the same time, might make tracing the cause hard.
-----Original Message-----
Sent: 24 November 2016 19:44
Subject: Re: [ceph-users] Ceph strange issue after adding a cache OSD.
I will try it, but I wanna see if it stays stable for a few days. Not sure if I should report this bug or not.
Post by Nick Fisk
Can you add them with different ID's, it won't look pretty but might get you out of this situation?
-----Original Message-----
Sent: 24 November 2016 15:43
Subject: Re: [ceph-users] Ceph strange issue after adding a cache OSD.
Yes, unfortunately, it is. And the story still continues. I have
noticed that only 4 OSD are doing this and zapping and readding it
does not solve the issue. Removing them completely from the cluster solve that issue, but I can't reuse their ID's. If I add another
one with the same ID it starts doing the same "funky" crashes. For now the cluster remains "stable" without the OSD's.
Post by Nick Fisk
Post by Nick Fisk
I take it you have size =2 or min_size=1 or something like that for the cache pool? 1 OSD shouldn’t prevent PG's from recovering.
Your best bet would be to see if the PG that is causing the assert
can be removed and let the OSD start up. If you are lucky, the PG
causing the problems might not be one which also has unfound objects,
otherwise you are likely have to get heavily involved in recovering objects with the object store tool.
Post by Nick Fisk
-----Original Message-----
Sent: 23 November 2016 13:56
Subject: Re: [ceph-users] Ceph strange issue after adding a cache OSD.
No, it's still missing some PGs and objects and can't recover as
it's blocked by that OSD. I can boot the OSD up by removing all
the PG related files from current directory, but that doesn't
solve the missing objects problem. Not really sure if I can move
the object
back to their place manually, but I will try it.
Post by Nick Fisk
Post by Nick Fisk
Sorry, I'm afraid I'm out of ideas about that one, that error
doesn't mean very much to me. The code suggests the OSD is
trying to
get an attr from the disk/filesystem, but for some reason it
doesn't like that. You could maybe whack the debug logging for OSD
and filestore up to max and try and see what PG/file is accessed
just before the crash, but I'm not sure what the fix would be,
even if
you manage to locate the dodgy PG.
Post by Nick Fisk
Post by Nick Fisk
Does the cluster have all PG's recovered now? Unless anyone else
can comment, you might be best removing/wiping and then re-
adding the OSD.
Post by Nick Fisk
-----Original Message-----
Sent: 23 November 2016 12:55
Subject: Re: [ceph-users] Ceph strange issue after adding a cache OSD.
osd/PG.cc: In function 'static int
PG::peek_map_epoch(ObjectStore*, spg_t, epoch_t*,
ceph::bufferlist*)' thread 7f36bbdd6880 time
2016-11-23 13:42:43.27
8539
osd/PG.cc: 2911: FAILED assert(r > 0)
ceph version 0.94.9 (fe6d859066244b97b24f09d46552afc2071e6f90)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x85) [0xbde2c5]
2: (PG::peek_map_epoch(ObjectStore*, spg_t, unsigned int*,
ceph::buffer::list*)+0x8ba) [0x7cf4da]
3: (OSD::load_pgs()+0x9ef) [0x6bd31f]
4: (OSD::init()+0x181a) [0x6c0e8a]
5: (main()+0x29dd) [0x6484bd]
6: (__libc_start_main()+0xf5) [0x7f36b916bb15]
7: /usr/bin/ceph-osd() [0x661ea9]
Post by Nick Fisk
-----Original Message-----
Sent: 23 November 2016 10:17
Subject: Re: [ceph-users] Ceph strange issue after adding a cache OSD.
Hi,
Looks like one of my colleagues increased the PG number
before it finished. I was flushing the whole cache tier and
it's currently stuck on ~80 GB of data, because of the OSD crashes.
I will look into the hitset counts and check what can be done.
Will provide an
update if I find anything or fix the issue.
Post by Nick Fisk
So I'm guessing when the PG split, the stats/hit_sets are not
how the OSD is expecting them to be and causes the crash. I
would
expect this has been caused by the PG splitting rather than
introducing extra OSD's. If you manage to get things stable by
bumping up the hitset count, then you probably want to try and
do a scrub to try and clean up the stats, which may then stop
this
happening when the hitset comes round to being trimmed again.
Post by Nick Fisk
Post by Nick Fisk
Post by Nick Fisk
Hi Daznis,
I'm not sure how much help I can be, but I will try my best.
I think the post-split stats error is probably benign,
although I think this suggests you also increased the
number of PG's in your cache pool? If so did you do this
before or after you added the
extra OSD's? This may have been the cause.
Post by Nick Fisk
On to the actual assert, this looks like it's part of the
code which trims the tiering hit set's. I don't understand
why its crashing out, but it must be related to an invalid
or missing hitset I would
imagine.
Post by Nick Fisk
https://github.com/ceph/ceph/blob/v0.94.9/src/osd/Replicat
edP
G.c
c#L
104
85
The only thing I could think of from looking at in the
code is that the function loops through all hitsets that
are above the max number (hit_set_count). I wonder if
setting this number higher would
mean it won't try and trim any hitsets and let things recover?
Post by Nick Fisk
DISCLAIMER
This is a hunch, it might not work or could possibly even
make things worse. Otherwise wait for someone who has a better idea to comment.
Nick
-----Original Message-----
From: ceph-users
On Behalf Of Daznis
Sent: 23 November 2016 05:57
Subject: [ceph-users] Ceph strange issue after adding a cache OSD.
Hello,
The story goes like this.
I have added another 3 drives to the caching layer. OSDs
were added to crush map one by one after each successful rebalance.
When
I
added the last OSD and went away for about an hour I
noticed that it's still not finished rebalancing. Further
investigation
showed me
that it one of the older cache SSD was restarting like
crazy before full boot. So I shut it down and waited for
a>> >> >> >> >> rebalance
without that
OSD. Less than an hour later I had another 2 OSD
restarting like crazy. I tried running scrubs on the PG's
logs asked me to, but
that did
not help. I'm currently stuck with " 8 scrub errors" and a complete dead cluster.
log_channel(cluster) log [WRN] : pg 15.8d has invalid
(post-split) stats; must scrub before tier agent can
activate
0> 2016-11-23 06:41:43.365602 7f935b4eb700 -1
osd/ReplicatedPG.cc: In function 'void
ReplicatedPG::hit_set_trim(ReplicatedPG::RepGather*, unsigned int)'
thread 7f935b4eb700 time 2016-11-23 06:41:43.363067
osd/ReplicatedPG.cc: 10521: FAILED assert(obc)
ceph version 0.94.9
(fe6d859066244b97b24f09d46552afc2071e6f90)
1: (ceph::__ceph_assert_fail(char const*, char const*,
int, char
const*)+0x85) [0xbde2c5]
2: (ReplicatedPG::hit_set_trim(ReplicatedPG::RepGather*,
unsigned
int)+0x75f) [0x87e89f]
3: (ReplicatedPG::hit_set_persist()+0xedb) [0x87f8bb]
(ReplicatedPG::do_op(std::tr1::shared_ptr<OpRequest>&)+0x
e3a
)
[0x8a11aa]
(ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>
&,
ThreadPool::TPHandle&)+0x68a) [0x83c37a]
6: (OSD::dequeue_op(boost::intrusive_ptr<PG>,
std::tr1::shared_ptr<OpRequest>,
ThreadPool::TPHandle&)+0x405) [0x69af05]
7: (OSD::ShardedOpWQ::_process(unsigned int,
ceph::heartbeat_handle_d*)+0x333) [0x69b473]
8: (ShardedThreadPool::shardedthreadpool_worker(unsigned
int)+0x86f) [0xbcd9cf]
9: (ShardedThreadPool::WorkThreadSharded::entry()+0x10)
[0xbcfb00]
10: (()+0x7dc5) [0x7f93b9df4dc5]
11: (clone()+0x6d) [0x7f93b88d5ced]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
I have tried looking with full debug enabled, but those
logs didn't help me much. I have tried to evict the cache
layer, but some objects are stuck and can't be removed.
Any suggestions would be
greatly appreciated.
Post by Nick Fisk
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Nick Fisk
2016-11-25 14:20:17 UTC
Permalink
It might be worth trying to raise a ticket with those errors and say that you believe they occurred after PG splitting on the cache tier and also include the asserts you originally posted.
-----Original Message-----
Sent: 25 November 2016 13:59
Subject: Re: [ceph-users] Ceph strange issue after adding a cache OSD.
2016-11-25 14:51:25.644495 7fb73eef8700 -1 log_channel(cluster) log [ERR] : 14.28 deep-scrub stat mismatch, got 145/144 objects, 0/0
clones, 57/57 dirty, 0/0 omap, 54/53 hit_set_archive, 0/0 whiteouts,
365399477/365399252 bytes,51328/51103 hit_set_archive bytes.
2016-11-25 14:55:56.529405 7f89bae5a700 -1 log_channel(cluster) log [ERR] : 13.dd deep-scrub stat mismatch, got 149/148 objects, 0/0
clones, 55/55 dirty, 0/0 omap, 63/61 hit_set_archive, 0/0 whiteouts,
360765725/360765503 bytes,55581/54097 hit_set_archive bytes.
I have no clue why they appeared. The cluster was running fine for months so I have no logs on how it happened. I just enabled them
after "shit hit the fan".
Post by Nick Fisk
Possibly, do you know the exact steps to reproduce? I'm guessing the PG splitting was the cause, but whether this on its own would
cause the problem or also needs the introduction of new OSD's at the same time, might make tracing the cause hard.
Post by Nick Fisk
-----Original Message-----
Sent: 24 November 2016 19:44
Subject: Re: [ceph-users] Ceph strange issue after adding a cache OSD.
I will try it, but I wanna see if it stays stable for a few days. Not sure if I should report this bug or not.
Post by Nick Fisk
Can you add them with different ID's, it won't look pretty but might get you out of this situation?
-----Original Message-----
Sent: 24 November 2016 15:43
Subject: Re: [ceph-users] Ceph strange issue after adding a cache OSD.
Yes, unfortunately, it is. And the story still continues. I have
noticed that only 4 OSD are doing this and zapping and readding it
does not solve the issue. Removing them completely from the
cluster solve that issue, but I can't reuse their ID's. If I add
another
one with the same ID it starts doing the same "funky" crashes. For now the cluster remains "stable" without the OSD's.
Post by Nick Fisk
Post by Nick Fisk
I take it you have size =2 or min_size=1 or something like that for the cache pool? 1 OSD shouldn’t prevent PG's from
recovering.
Post by Nick Fisk
Post by Nick Fisk
Post by Nick Fisk
Your best bet would be to see if the PG that is causing the
assert can be removed and let the OSD start up. If you are
lucky, the PG
causing the problems might not be one which also has unfound
objects, otherwise you are likely have to get heavily involved in recovering objects with the object store tool.
Post by Nick Fisk
-----Original Message-----
Sent: 23 November 2016 13:56
Subject: Re: [ceph-users] Ceph strange issue after adding a cache OSD.
No, it's still missing some PGs and objects and can't recover
as it's blocked by that OSD. I can boot the OSD up by removing
all the PG related files from current directory, but that
doesn't solve the missing objects problem. Not really sure if I
can move the object
back to their place manually, but I will try it.
Post by Nick Fisk
Post by Nick Fisk
Sorry, I'm afraid I'm out of ideas about that one, that error
doesn't mean very much to me. The code suggests the OSD is
trying to
get an attr from the disk/filesystem, but for some reason it
doesn't like that. You could maybe whack the debug logging for
OSD and filestore up to max and try and see what PG/file is
accessed just before the crash, but I'm not sure what the fix
would be, even if
you manage to locate the dodgy PG.
Post by Nick Fisk
Post by Nick Fisk
Does the cluster have all PG's recovered now? Unless anyone
else can comment, you might be best removing/wiping and then
re-
adding the OSD.
Post by Nick Fisk
-----Original Message-----
Sent: 23 November 2016 12:55
Subject: Re: [ceph-users] Ceph strange issue after adding a cache OSD.
osd/PG.cc: In function 'static int
PG::peek_map_epoch(ObjectStore*, spg_t, epoch_t*,
ceph::bufferlist*)' thread 7f36bbdd6880 time
2016-11-23 13:42:43.27
8539
osd/PG.cc: 2911: FAILED assert(r > 0)
ceph version 0.94.9
(fe6d859066244b97b24f09d46552afc2071e6f90)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x85) [0xbde2c5]
2: (PG::peek_map_epoch(ObjectStore*, spg_t, unsigned int*,
ceph::buffer::list*)+0x8ba) [0x7cf4da]
3: (OSD::load_pgs()+0x9ef) [0x6bd31f]
4: (OSD::init()+0x181a) [0x6c0e8a]
5: (main()+0x29dd) [0x6484bd]
6: (__libc_start_main()+0xf5) [0x7f36b916bb15]
7: /usr/bin/ceph-osd() [0x661ea9]
Post by Nick Fisk
-----Original Message-----
Sent: 23 November 2016 10:17
Subject: Re: [ceph-users] Ceph strange issue after adding a cache OSD.
Hi,
Looks like one of my colleagues increased the PG number
before it finished. I was flushing the whole cache tier
and it's currently stuck on ~80 GB of data, because of the OSD crashes.
I will look into the hitset counts and check what can be done.
Will provide an
update if I find anything or fix the issue.
Post by Nick Fisk
So I'm guessing when the PG split, the stats/hit_sets are
not how the OSD is expecting them to be and causes the
crash. I would
expect this has been caused by the PG splitting rather than
introducing extra OSD's. If you manage to get things stable
by bumping up the hitset count, then you probably want to
try and do a scrub to try and clean up the stats, which may
then stop this
happening when the hitset comes round to being trimmed again.
Post by Nick Fisk
Post by Nick Fisk
Post by Nick Fisk
Hi Daznis,
I'm not sure how much help I can be, but I will try my best.
I think the post-split stats error is probably benign,
although I think this suggests you also increased the
number of PG's in your cache pool? If so did you do
this before or after you added the
extra OSD's? This may have been the cause.
Post by Nick Fisk
On to the actual assert, this looks like it's part of
the code which trims the tiering hit set's. I don't
understand why its crashing out, but it must be related
to an invalid or missing hitset I would
imagine.
Post by Nick Fisk
https://github.com/ceph/ceph/blob/v0.94.9/src/osd/Repli
cat
edP
G.c
c#L
104
85
The only thing I could think of from looking at in the
code is that the function loops through all hitsets
that are above the max number (hit_set_count). I wonder
if setting this number higher would
mean it won't try and trim any hitsets and let things recover?
Post by Nick Fisk
DISCLAIMER
This is a hunch, it might not work or could possibly
even make things worse. Otherwise wait for someone who has a better idea to comment.
Nick
-----Original Message-----
From: ceph-users
On Behalf Of Daznis
Sent: 23 November 2016 05:57
Subject: [ceph-users] Ceph strange issue after adding a cache OSD.
Hello,
The story goes like this.
I have added another 3 drives to the caching layer.
OSDs were added to crush map one by one after each successful rebalance.
When
I
added the last OSD and went away for about an hour I
noticed that it's still not finished rebalancing.
Further investigation
showed me
that it one of the older cache SSD was restarting like
crazy before full boot. So I shut it down and waited
for
a>> >> >> >> >> rebalance
without that
OSD. Less than an hour later I had another 2 OSD
restarting like crazy. I tried running scrubs on the
PG's logs asked me to, but
that did
not help. I'm currently stuck with " 8 scrub errors" and a complete dead cluster.
log_channel(cluster) log [WRN] : pg 15.8d has invalid
(post-split) stats; must scrub before tier agent can
activate
0> 2016-11-23 06:41:43.365602 7f935b4eb700 -1
osd/ReplicatedPG.cc: In function 'void
ReplicatedPG::hit_set_trim(ReplicatedPG::RepGather*, unsigned int)'
thread 7f935b4eb700 time 2016-11-23 06:41:43.363067
osd/ReplicatedPG.cc: 10521: FAILED assert(obc)
ceph version 0.94.9
(fe6d859066244b97b24f09d46552afc2071e6f90)
1: (ceph::__ceph_assert_fail(char const*, char
const*, int, char
const*)+0x85) [0xbde2c5]
(ReplicatedPG::hit_set_trim(ReplicatedPG::RepGather*,
unsigned
int)+0x75f) [0x87e89f]
3: (ReplicatedPG::hit_set_persist()+0xedb) [0x87f8bb]
(ReplicatedPG::do_op(std::tr1::shared_ptr<OpRequest>&)
+0x
e3a
)
[0x8a11aa]
(ReplicatedPG::do_request(std::tr1::shared_ptr<OpReque
st>
&,
ThreadPool::TPHandle&)+0x68a) [0x83c37a]
6: (OSD::dequeue_op(boost::intrusive_ptr<PG>,
std::tr1::shared_ptr<OpRequest>,
ThreadPool::TPHandle&)+0x405) [0x69af05]
7: (OSD::ShardedOpWQ::_process(unsigned int,
ceph::heartbeat_handle_d*)+0x333) [0x69b473]
(ShardedThreadPool::shardedthreadpool_worker(unsigned
int)+0x86f) [0xbcd9cf]
(ShardedThreadPool::WorkThreadSharded::entry()+0x10)
[0xbcfb00]
10: (()+0x7dc5) [0x7f93b9df4dc5]
11: (clone()+0x6d) [0x7f93b88d5ced]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
I have tried looking with full debug enabled, but
those logs didn't help me much. I have tried to evict
the cache layer, but some objects are stuck and can't be removed.
Any suggestions would be
greatly appreciated.
Post by Nick Fisk
_______________________________________________
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Daznis
2017-01-09 13:06:24 UTC
Permalink
Hello Nick,


Thank you for your help. We have contacted RedHat for additional help
and they think this bug is related to gmt bug in Version 94.7 of ceph.
I'm not really sure how can this be as the cluster was using 94.6/94.9
versions. After a month + of slowly moving data I'm up with all the
same versions of OS/Software on all the ceph cluster and need to
recreate the cache layer to remove those missing hit set errors.
The only solution so far was setting hit_set_count to 0 and removing
the cache layer to fix it. I will update this ticket once I'm done
with recreating the cache layer if those errors are gone completely.

Regards,

Darius
Post by Nick Fisk
It might be worth trying to raise a ticket with those errors and say that you believe they occurred after PG splitting on the cache tier and also include the asserts you originally posted.
-----Original Message-----
Sent: 25 November 2016 13:59
Subject: Re: [ceph-users] Ceph strange issue after adding a cache OSD.
2016-11-25 14:51:25.644495 7fb73eef8700 -1 log_channel(cluster) log [ERR] : 14.28 deep-scrub stat mismatch, got 145/144 objects, 0/0
clones, 57/57 dirty, 0/0 omap, 54/53 hit_set_archive, 0/0 whiteouts,
365399477/365399252 bytes,51328/51103 hit_set_archive bytes.
2016-11-25 14:55:56.529405 7f89bae5a700 -1 log_channel(cluster) log [ERR] : 13.dd deep-scrub stat mismatch, got 149/148 objects, 0/0
clones, 55/55 dirty, 0/0 omap, 63/61 hit_set_archive, 0/0 whiteouts,
360765725/360765503 bytes,55581/54097 hit_set_archive bytes.
I have no clue why they appeared. The cluster was running fine for months so I have no logs on how it happened. I just enabled them
after "shit hit the fan".
Post by Nick Fisk
Possibly, do you know the exact steps to reproduce? I'm guessing the PG splitting was the cause, but whether this on its own would
cause the problem or also needs the introduction of new OSD's at the same time, might make tracing the cause hard.
Post by Nick Fisk
-----Original Message-----
Sent: 24 November 2016 19:44
Subject: Re: [ceph-users] Ceph strange issue after adding a cache OSD.
I will try it, but I wanna see if it stays stable for a few days. Not sure if I should report this bug or not.
Post by Nick Fisk
Can you add them with different ID's, it won't look pretty but might get you out of this situation?
-----Original Message-----
Behalf Of Daznis
Sent: 24 November 2016 15:43
Subject: Re: [ceph-users] Ceph strange issue after adding a cache OSD.
Yes, unfortunately, it is. And the story still continues. I have
noticed that only 4 OSD are doing this and zapping and readding it
does not solve the issue. Removing them completely from the
cluster solve that issue, but I can't reuse their ID's. If I add
another
one with the same ID it starts doing the same "funky" crashes. For now the cluster remains "stable" without the OSD's.
Post by Nick Fisk
Post by Nick Fisk
I take it you have size =2 or min_size=1 or something like that for the cache pool? 1 OSD shouldn’t prevent PG's from
recovering.
Post by Nick Fisk
Post by Nick Fisk
Post by Nick Fisk
Your best bet would be to see if the PG that is causing the
assert can be removed and let the OSD start up. If you are
lucky, the PG
causing the problems might not be one which also has unfound
objects, otherwise you are likely have to get heavily involved in recovering objects with the object store tool.
Post by Nick Fisk
-----Original Message-----
Sent: 23 November 2016 13:56
Subject: Re: [ceph-users] Ceph strange issue after adding a cache OSD.
No, it's still missing some PGs and objects and can't recover
as it's blocked by that OSD. I can boot the OSD up by removing
all the PG related files from current directory, but that
doesn't solve the missing objects problem. Not really sure if I
can move the object
back to their place manually, but I will try it.
Post by Nick Fisk
Post by Nick Fisk
Sorry, I'm afraid I'm out of ideas about that one, that error
doesn't mean very much to me. The code suggests the OSD is
trying to
get an attr from the disk/filesystem, but for some reason it
doesn't like that. You could maybe whack the debug logging for
OSD and filestore up to max and try and see what PG/file is
accessed just before the crash, but I'm not sure what the fix
would be, even if
you manage to locate the dodgy PG.
Post by Nick Fisk
Post by Nick Fisk
Does the cluster have all PG's recovered now? Unless anyone
else can comment, you might be best removing/wiping and then
re-
adding the OSD.
Post by Nick Fisk
-----Original Message-----
Sent: 23 November 2016 12:55
Subject: Re: [ceph-users] Ceph strange issue after adding a cache OSD.
osd/PG.cc: In function 'static int
PG::peek_map_epoch(ObjectStore*, spg_t, epoch_t*,
ceph::bufferlist*)' thread 7f36bbdd6880 time
2016-11-23 13:42:43.27
8539
osd/PG.cc: 2911: FAILED assert(r > 0)
ceph version 0.94.9
(fe6d859066244b97b24f09d46552afc2071e6f90)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x85) [0xbde2c5]
2: (PG::peek_map_epoch(ObjectStore*, spg_t, unsigned int*,
ceph::buffer::list*)+0x8ba) [0x7cf4da]
3: (OSD::load_pgs()+0x9ef) [0x6bd31f]
4: (OSD::init()+0x181a) [0x6c0e8a]
5: (main()+0x29dd) [0x6484bd]
6: (__libc_start_main()+0xf5) [0x7f36b916bb15]
7: /usr/bin/ceph-osd() [0x661ea9]
Post by Nick Fisk
-----Original Message-----
Sent: 23 November 2016 10:17
Subject: Re: [ceph-users] Ceph strange issue after adding a cache OSD.
Hi,
Looks like one of my colleagues increased the PG number
before it finished. I was flushing the whole cache tier
and it's currently stuck on ~80 GB of data, because of the OSD crashes.
I will look into the hitset counts and check what can be done.
Will provide an
update if I find anything or fix the issue.
Post by Nick Fisk
So I'm guessing when the PG split, the stats/hit_sets are
not how the OSD is expecting them to be and causes the
crash. I would
expect this has been caused by the PG splitting rather than
introducing extra OSD's. If you manage to get things stable
by bumping up the hitset count, then you probably want to
try and do a scrub to try and clean up the stats, which may
then stop this
happening when the hitset comes round to being trimmed again.
Post by Nick Fisk
Post by Nick Fisk
Post by Nick Fisk
Hi Daznis,
I'm not sure how much help I can be, but I will try my best.
I think the post-split stats error is probably benign,
although I think this suggests you also increased the
number of PG's in your cache pool? If so did you do
this before or after you added the
extra OSD's? This may have been the cause.
Post by Nick Fisk
On to the actual assert, this looks like it's part of
the code which trims the tiering hit set's. I don't
understand why its crashing out, but it must be related
to an invalid or missing hitset I would
imagine.
Post by Nick Fisk
https://github.com/ceph/ceph/blob/v0.94.9/src/osd/Repli
cat
edP
G.c
c#L
104
85
The only thing I could think of from looking at in the
code is that the function loops through all hitsets
that are above the max number (hit_set_count). I wonder
if setting this number higher would
mean it won't try and trim any hitsets and let things recover?
Post by Nick Fisk
DISCLAIMER
This is a hunch, it might not work or could possibly
even make things worse. Otherwise wait for someone who has a better idea to comment.
Nick
-----Original Message-----
From: ceph-users
On Behalf Of Daznis
Sent: 23 November 2016 05:57
Subject: [ceph-users] Ceph strange issue after adding a cache OSD.
Hello,
The story goes like this.
I have added another 3 drives to the caching layer.
OSDs were added to crush map one by one after each successful rebalance.
When
I
added the last OSD and went away for about an hour I
noticed that it's still not finished rebalancing.
Further investigation
showed me
that it one of the older cache SSD was restarting like
crazy before full boot. So I shut it down and waited
for
a>> >> >> >> >> rebalance
without that
OSD. Less than an hour later I had another 2 OSD
restarting like crazy. I tried running scrubs on the
PG's logs asked me to, but
that did
not help. I'm currently stuck with " 8 scrub errors" and a complete dead cluster.
log_channel(cluster) log [WRN] : pg 15.8d has invalid
(post-split) stats; must scrub before tier agent can
activate
0> 2016-11-23 06:41:43.365602 7f935b4eb700 -1
osd/ReplicatedPG.cc: In function 'void
ReplicatedPG::hit_set_trim(ReplicatedPG::RepGather*, unsigned int)'
thread 7f935b4eb700 time 2016-11-23 06:41:43.363067
osd/ReplicatedPG.cc: 10521: FAILED assert(obc)
ceph version 0.94.9
(fe6d859066244b97b24f09d46552afc2071e6f90)
1: (ceph::__ceph_assert_fail(char const*, char
const*, int, char
const*)+0x85) [0xbde2c5]
(ReplicatedPG::hit_set_trim(ReplicatedPG::RepGather*,
unsigned
int)+0x75f) [0x87e89f]
3: (ReplicatedPG::hit_set_persist()+0xedb) [0x87f8bb]
(ReplicatedPG::do_op(std::tr1::shared_ptr<OpRequest>&)
+0x
e3a
)
[0x8a11aa]
(ReplicatedPG::do_request(std::tr1::shared_ptr<OpReque
st>
&,
ThreadPool::TPHandle&)+0x68a) [0x83c37a]
6: (OSD::dequeue_op(boost::intrusive_ptr<PG>,
std::tr1::shared_ptr<OpRequest>,
ThreadPool::TPHandle&)+0x405) [0x69af05]
7: (OSD::ShardedOpWQ::_process(unsigned int,
ceph::heartbeat_handle_d*)+0x333) [0x69b473]
(ShardedThreadPool::shardedthreadpool_worker(unsigned
int)+0x86f) [0xbcd9cf]
(ShardedThreadPool::WorkThreadSharded::entry()+0x10)
[0xbcfb00]
10: (()+0x7dc5) [0x7f93b9df4dc5]
11: (clone()+0x6d) [0x7f93b88d5ced]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
I have tried looking with full debug enabled, but
those logs didn't help me much. I have tried to evict
the cache layer, but some objects are stuck and can't be removed.
Any suggestions would be
greatly appreciated.
Post by Nick Fisk
_______________________________________________
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Loading...