[Gluster-devel] BitRot notes

Deepak Shetty dpkshetty at gmail.com
Wed Dec 10 04:50:16 UTC 2014


Thanks Venky, I also wanted to put forward how this can help in a
openstack/cloud env.
where we have 2 distinct admin roles (virt/openstack admin and storage
admin)


1) Gluster volume 'health' should display the health status (OK, warn,
fatal/error etc)
2) Based on that the admin can query 'health status' to know 'due to which
component (AFR, quorum, geo-rep etc)  the health status is 'other than OK'
3) Based on that component, run the right gluster cmd ( scrub status, afr
status, split brain status? etc) to go deeper into where the problem lies

1 & 2 can be done by virt admin who then alerts the storage admin who then
does 3 to figure the root cause and take necessary action

thanx,
deepak



On Tue, Dec 9, 2014 at 2:52 PM, Venky Shankar <yknev.shankar at gmail.com>
wrote:

> On Tue, Dec 9, 2014 at 1:41 PM, Deepak Shetty <dpkshetty at gmail.com> wrote:
> > We can use bitrot to provide a 'health' status for gluster volumes.
> > Hence I would like to propose (from a upstream/community perspective) the
> > notion of 'health' status (as part of gluster volume info) which can
> derive
> > its value from:
> >
> > 1) Bitrot
> >     If any files are corrupted and bitrot is yet to repair them and/or
> its a
> > signal to admin to do some manual operation to repair the corrupted files
> > (for cases where we only detect, not correct)
> >
> > 2) brick status
> >     Depending on brick offline/online
> >
> > 3) AFR status
> >     Whether we have all copies in sync or not
>
> This makes sense. Having a notion of "volume health" helps choosing
> intelligently from a list of volumes.
>
> >
> > This i believe is on similar lines to what Ceph does today (health
> status :
> > OK, WARN, ERROR)
>
> Yes, Ceph derives those notions from PGs. Gluster can do it for
> replicas and/or files marked by bitrot scrubber.
>
> > The health status derivation can be pluggable, so that in future more
> > components can be added to query for the composite health status of the
> > gluster volume.
> >
> > In all of the above cases, as long as data can be served by the gluster
> > volume reliably gluster volume status will be Started/Available, but
> Health
> > status can be 'degraded' or 'warn'
>
> WARN may be too strict, but something lenient enough yes descriptive
> should be chosen. Ceph does it pretty well:
> http://ceph.com/docs/master/rados/operations/monitoring-osd-pg/
>
> >
> > This has many uses:
> >
> > 1) It helps provide indication to the admin that something is amiss and
> he
> > can check based on:
> > bitrot / scrub status
> > brick status
> > AFR status
> >
> > and take necessary action
> >
> > 2) It helps mgmt applns (openstack for eg) make an intelligent decision
> > based on the health status (whether or not to pick this gluster volume
> for
> > this create volume operation), so it helps acts a a coarse level filter
> >
> > 3) In general it gives user an idea of the health of the volume (which is
> > different than the availability status (whether or not volume can serve
> > data))
> > For eg: If we have a pure DHT volume, and bitrot detects silent file
> > corruption (and we are not auto correcting) having Gluster volume status
> as
> > available/started isn't entirely correct !
>
> +1
>
> >
> > thanx,
> > deepak
> >
> >
> > On Fri, Dec 5, 2014 at 11:31 PM, Venky Shankar <yknev.shankar at gmail.com>
> > wrote:
> >>
> >> On Fri, Nov 28, 2014 at 10:00 PM, Vijay Bellur <vbellur at redhat.com>
> wrote:
> >> > On 11/28/2014 08:30 AM, Venky Shankar wrote:
> >> >>
> >> >> [snip]
> >> >>>
> >> >>>
> >> >>> 1. Can the bitd be one per node like self-heal-daemon and other
> >> >>> "global"
> >> >>> services? I worry about creating 2 * N processes for N bricks in a
> >> >>> node.
> >> >>> Maybe we can consider having one thread per volume/brick etc. in a
> >> >>> single
> >> >>> bitd process to make it perform better.
> >> >>
> >> >>
> >> >> Absolutely.
> >> >> There would be one bitrot daemon per node, per volume.
> >> >>
> >> >
> >> > Do you foresee any problems in having one daemon per node for all
> >> > volumes?
> >>
> >> Not technically :). Probably that's a nice thing to do.
> >>
> >> >
> >> >>
> >> >>>
> >> >>> 3. I think the algorithm for checksum computation can vary within
> the
> >> >>> volume. I see a reference to "Hashtype is persisted along side the
> >> >>> checksum
> >> >>> and can be tuned per file type." Is this correct? If so:
> >> >>>
> >> >>> a) How will the policy be exposed to the user?
> >> >>
> >> >>
> >> >> Bitrot daemon would have a configuration file that can be configured
> >> >> via Gluster CLI. Tuning hash types could be based on file types or
> >> >> file name patterns (regexes) [which is a bit tricky as bitrot would
> >> >> work on GFIDs rather than filenames, but this can be solved by a
> level
> >> >> of indirection].
> >> >>
> >> >>>
> >> >>> b) It would be nice to have the algorithm for computing checksums be
> >> >>> pluggable. Are there any thoughts on pluggability?
> >> >>
> >> >>
> >> >> Do you mean the default hash algorithm be configurable? If yes, then
> >> >> that's planned.
> >> >
> >> >
> >> > Sounds good.
> >> >
> >> >>
> >> >>>
> >> >>> c) What are the steps involved in changing the hashtype/algorithm
> for
> >> >>> a
> >> >>> file?
> >> >>
> >> >>
> >> >> Policy changes for file {types, patterns} are lazy, i.e., taken into
> >> >> effect during the next recompute. For objects that are never modified
> >> >> (after initial checksum compute), scrubbing can recompute the
> checksum
> >> >> using the new hash _after_ verifying the integrity of a file with the
> >> >> old hash.
> >> >
> >> >
> >> >>
> >> >>>
> >> >>> 4. Is the fop on which change detection gets triggered configurable?
> >> >>
> >> >>
> >> >> As of now all data modification fops trigger checksum calculation.
> >> >>
> >> >
> >> > Wish I was more clear on this in my OP. Is the fop on which checksum
> >> > verification/bitrot detection happens configurable? The feature page
> >> > talks
> >> > about "open" being a trigger point for this. Users might want to
> trigger
> >> > detection on a "read" operation and not on open. It would be good to
> >> > provide
> >> > this flexibility.
> >>
> >> Ah! ok. As of now it's mostly open() and read(). Inline verification
> >> would be "off" by default due to obvious reasons.
> >>
> >> >
> >> >>
> >> >>>
> >> >>> 6. Any thoughts on integrating the bitrot repair framework with
> >> >>> self-heal?
> >> >>
> >> >>
> >> >> There are some thoughts on integration with self-heal daemon and EC.
> >> >> I'm coming up with a doc which covers those [reason for delay in
> >> >> replying to your questions ;)]. Expect the doc in in gluster-devel@
> >> >> soon.
> >> >
> >> >
> >> > Will look forward to this.
> >> >
> >> >>
> >> >>>
> >> >>> 7. How does detection figure out that lazy updation is still pending
> >> >>> and
> >> >>> not
> >> >>> raise a false positive?
> >> >>
> >> >>
> >> >> That's one of the things that myself and Rachana discussed yesterday.
> >> >> Should scrubbing *wait* till checksum updating is still in progress
> or
> >> >> is it expected that scrubbing happens when there is no active I/O
> >> >> operations on the volume (both of which imply that bitrot daemon
> needs
> >> >> to know when it's done it's job).
> >> >>
> >> >> If both scrub and checksum updating go in parallel, then there needs
> >> >> to be way to synchronize those operations. Maybe, compute checksum on
> >> >> priority which is provided by the scrub process as a hint (that
> leaves
> >> >> little window for rot though) ?
> >> >>
> >> >> Any thoughts?
> >> >
> >> >
> >> > Waiting for no active I/O in the volume might be a difficult condition
> >> > to
> >> > reach in some deployments.
> >> >
> >> > Some form of waiting is necessary to prevent false positives. One
> >> > possibility might be to mark an object as dirty till checksum updation
> >> > is
> >> > complete. Verification/scrub can then be skipped for dirty objects.
> >>
> >> Makes sense. Thanks!
> >>
> >> >
> >> > -Vijay
> >> >
> >> _______________________________________________
> >> Gluster-devel mailing list
> >> Gluster-devel at gluster.org
> >> http://supercolony.gluster.org/mailman/listinfo/gluster-devel
> >
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-devel/attachments/20141210/f614393f/attachment-0001.html>


More information about the Gluster-devel mailing list