Discussion:
[ceph-users] PG auto repair with BlueStore
Wido den Hollander
2018-08-24 06:55:20 UTC
Permalink
Hi,

osd_scrub_auto_repair still defaults to false and I was wondering how we
think about enabling this feature by default.

Would we say it's safe to enable this with BlueStore?

Wido
Wido den Hollander
2018-11-15 17:40:14 UTC
Permalink
Hi,

This question is actually still outstanding. Is there any good reason to
keep auto repair for scrub errors disabled with BlueStore?

I couldn't think of a reason when using size=3 and min_size=2, so just
wondering.

Thanks!

Wido
Post by Wido den Hollander
Hi,
osd_scrub_auto_repair still defaults to false and I was wondering how we
think about enabling this feature by default.
Would we say it's safe to enable this with BlueStore?
Wido
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Mark Schouten
2018-11-15 18:45:13 UTC
Permalink
As a user, I’m very surprised that this isn’t a default setting.

Mark Schouten
Post by Wido den Hollander
Hi,
This question is actually still outstanding. Is there any good reason to
keep auto repair for scrub errors disabled with BlueStore?
I couldn't think of a reason when using size=3 and min_size=2, so just
wondering.
Thanks!
Wido
Post by Wido den Hollander
Hi,
osd_scrub_auto_repair still defaults to false and I was wondering how we
think about enabling this feature by default.
Would we say it's safe to enable this with BlueStore?
Wido
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
koukou73gr
2018-11-15 18:51:29 UTC
Permalink
Are there any means to notify the administrator that an auto-repair has
taken place?

-K.
Post by Mark Schouten
As a user, I’m very surprised that this isn’t a default setting.
Mark Schouten
Post by Wido den Hollander
Hi,
This question is actually still outstanding. Is there any good reason to
keep auto repair for scrub errors disabled with BlueStore?
I couldn't think of a reason when using size=3 and min_size=2, so just
wondering.
Thanks!
Wido
Post by Wido den Hollander
Hi,
osd_scrub_auto_repair still defaults to false and I was wondering how we
think about enabling this feature by default.
Would we say it's safe to enable this with BlueStore?
Wido
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Wido den Hollander
2018-11-16 07:26:02 UTC
Permalink
Post by koukou73gr
Are there any means to notify the administrator that an auto-repair has
taken place?
I don't think so. You'll see the cluster go to HEALTH_ERR for a while
before it turns to HEALTH_OK again after the PG has been repaired.

You would have to search the cluster logs to find out that a auto repair
took place on a Placement Group.

Wido
Post by koukou73gr
-K.
Post by Mark Schouten
As a user, I’m very surprised that this isn’t a default setting.
Mark Schouten
Post by Wido den Hollander
Hi,
This question is actually still outstanding. Is there any good reason to
keep auto repair for scrub errors disabled with BlueStore?
I couldn't think of a reason when using size=3 and min_size=2, so just
wondering.
Thanks!
Wido
Post by Wido den Hollander
Hi,
osd_scrub_auto_repair still defaults to false and I was wondering how we
think about enabling this feature by default.
Would we say it's safe to enable this with BlueStore?
Wido
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Paul Emmerich
2018-11-17 20:38:51 UTC
Permalink
Post by Wido den Hollander
Post by koukou73gr
Are there any means to notify the administrator that an auto-repair has
taken place?
I don't think so. You'll see the cluster go to HEALTH_ERR for a while
before it turns to HEALTH_OK again after the PG has been repaired.
and I think even this is too much. No point in triggering a monitoring
system in the middle of the night when the scrubs are running just
because of some bit rot on a disk. Losing a few bits on disks here and
there is a perfectly normal and expected scenario that Ceph can take
care of all by itself without triggering an health *error*. It
certainly doesn't require immediate attention (with auto repair
enabled) like the error state indicates.
The message in the cluster log should be enough.
--
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90
Post by Wido den Hollander
Post by koukou73gr
Are there any means to notify the administrator that an auto-repair has
taken place?
I don't think so. You'll see the cluster go to HEALTH_ERR for a while
before it turns to HEALTH_OK again after the PG has been repaired.
You would have to search the cluster logs to find out that a auto repair
took place on a Placement Group.
Wido
Post by koukou73gr
-K.
Post by Mark Schouten
As a user, I’m very surprised that this isn’t a default setting.
Mark Schouten
Post by Wido den Hollander
Hi,
This question is actually still outstanding. Is there any good reason to
keep auto repair for scrub errors disabled with BlueStore?
I couldn't think of a reason when using size=3 and min_size=2, so just
wondering.
Thanks!
Wido
Post by Wido den Hollander
Hi,
osd_scrub_auto_repair still defaults to false and I was wondering how we
think about enabling this feature by default.
Would we say it's safe to enable this with BlueStore?
Wido
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Matthew Vernon
2018-11-15 19:03:07 UTC
Permalink
Hi,

[apropos auto-repair for scrub settings]
Post by Mark Schouten
As a user, I’m very surprised that this isn’t a default setting.
We've been to cowardly to do it so far; even on a large cluster the
occasional ceph pg repair hasn't taken up too much admin time, and the
fact it isn't enabled by default has put us off. This sometimes helps us
spot OSD drives "on the way out" that haven't actually failed yet, but
I'd be in favour of auto-repair iff we're confident it's safe (to be
fair, ceph pg repair is the first port of call anyway, so it's not clear
what we gain by having a human type it).

Regards,

Matthew
--
The Wellcome Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.
Wido den Hollander
2018-11-16 07:25:01 UTC
Permalink
Post by Mark Schouten
As a user, I’m very surprised that this isn’t a default setting.
That is because you can also have FileStore OSDs in a cluster on which
such a auto-repair is not safe.

Wido
Post by Mark Schouten
Mark Schouten
Post by Wido den Hollander
Hi,
This question is actually still outstanding. Is there any good reason to
keep auto repair for scrub errors disabled with BlueStore?
I couldn't think of a reason when using size=3 and min_size=2, so just
wondering.
Thanks!
Wido
Post by Wido den Hollander
Hi,
osd_scrub_auto_repair still defaults to false and I was wondering how we
think about enabling this feature by default.
Would we say it's safe to enable this with BlueStore?
Wido
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Mark Schouten
2018-11-16 07:48:31 UTC
Permalink
Which, as a user, is very surprising to me too..
--

Mark Schouten  | Tuxis Internet Engineering
KvK: 61527076  | http://www.tuxis.nl/
T: 0318 200208 | ***@tuxis.nl
 



----- Original Message -----


From: Wido den Hollander (***@42on.com)
Date: 16-11-2018 08:25
To: Mark Schouten (***@tuxis.nl)
Cc: Ceph Users (ceph-***@ceph.com)
Subject: Re: [ceph-users] PG auto repair with BlueStore
Post by Mark Schouten
As a user, I’m very surprised that this isn’t a default setting.
That is because you can also have FileStore OSDs in a cluster on which
such a auto-repair is not safe.

Wido
Post by Mark Schouten
Mark Schouten
Post by Wido den Hollander
Hi,
This question is actually still outstanding. Is there any good reason to
keep auto repair for scrub errors disabled with BlueStore?
I couldn't think of a reason when using size=3 and min_size=2, so just
wondering.
Thanks!
Wido
Post by Wido den Hollander
Hi,
osd_scrub_auto_repair still defaults to false and I was wondering how we
think about enabling this feature by default.
Would we say it's safe to enable this with BlueStore?
Wido
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Paul Emmerich
2018-11-17 20:05:57 UTC
Permalink
While I also believe it to be perfectly safe on a bluestore cluster
(especially since there's osd_scrub_auto_repair_num_errors if there's
more wrong than your usual bit rot), we also don't run any cluster
with this option at the moment. We had it enabled for some time before
we backported the OOM-read-error stuff on some clusters.

But there's a small operational issue with auto repair at the moment:
this option will occasionally set the repair flag on a PG without any
scrub errors during scrubbing for some reason which triggers a health
error.

We've had a quick look at the code and couldn't figure out how the
repair flag gets set in some cases on perfectly healthy PGs. Does it
maybe only get set for a very short time while finishing up the scrub
and that's not always picked up in time?
Anyways, a potential work-around for this would be to maybe remove the
repair state from the conditions for the PG_DAMAGED warning?

Paul
--
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90
Post by Mark Schouten
Which, as a user, is very surprising to me too..
--
Mark Schouten | Tuxis Internet Engineering
KvK: 61527076 | http://www.tuxis.nl/
----- Original Message -----
Date: 16-11-2018 08:25
Subject: Re: [ceph-users] PG auto repair with BlueStore
Post by Mark Schouten
As a user, I’m very surprised that this isn’t a default setting.
That is because you can also have FileStore OSDs in a cluster on which
such a auto-repair is not safe.
Wido
Post by Mark Schouten
Mark Schouten
Post by Wido den Hollander
Hi,
This question is actually still outstanding. Is there any good reason to
keep auto repair for scrub errors disabled with BlueStore?
I couldn't think of a reason when using size=3 and min_size=2, so just
wondering.
Thanks!
Wido
Post by Wido den Hollander
Hi,
osd_scrub_auto_repair still defaults to false and I was wondering how we
think about enabling this feature by default.
Would we say it's safe to enable this with BlueStore?
Wido
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Loading...