[Gluster-devel] BitRot notes

Deepak Shetty dpkshetty at gmail.com
Tue Dec 9 08:11:45 UTC 2014


We can use bitrot to provide a 'health' status for gluster volumes.
Hence I would like to propose (from a upstream/community perspective) the
notion of 'health' status (as part of gluster volume info) which can derive
its value from:

1) Bitrot
    If any files are corrupted and bitrot is yet to repair them and/or its
a signal to admin to do some manual operation to repair the corrupted files
(for cases where we only detect, not correct)

2) brick status
    Depending on brick offline/online

3) AFR status
    Whether we have all copies in sync or not

This i believe is on similar lines to what Ceph does today (health status :
OK, WARN, ERROR)
The health status derivation can be pluggable, so that in future more
components can be added to query for the composite health status of the
gluster volume.

In all of the above cases, as long as data can be served by the gluster
volume reliably gluster volume status will be Started/Available, but Health
status can be 'degraded' or 'warn'

This has many uses:

1) It helps provide indication to the admin that something is amiss and he
can check based on:
bitrot / scrub status
brick status
AFR status

and take necessary action

2) It helps mgmt applns (openstack for eg) make an intelligent decision
based on the health status (whether or not to pick this gluster volume for
this create volume operation), so it helps acts a a coarse level filter

3) In general it gives user an idea of the health of the volume (which is
different than the availability status (whether or not volume can serve
data))
For eg: If we have a pure DHT volume, and bitrot detects silent file
corruption (and we are not auto correcting) having Gluster volume status as
available/started isn't entirely correct !

thanx,
deepak


On Fri, Dec 5, 2014 at 11:31 PM, Venky Shankar <yknev.shankar at gmail.com>
wrote:

> On Fri, Nov 28, 2014 at 10:00 PM, Vijay Bellur <vbellur at redhat.com> wrote:
> > On 11/28/2014 08:30 AM, Venky Shankar wrote:
> >>
> >> [snip]
> >>>
> >>>
> >>> 1. Can the bitd be one per node like self-heal-daemon and other
> "global"
> >>> services? I worry about creating 2 * N processes for N bricks in a
> node.
> >>> Maybe we can consider having one thread per volume/brick etc. in a
> single
> >>> bitd process to make it perform better.
> >>
> >>
> >> Absolutely.
> >> There would be one bitrot daemon per node, per volume.
> >>
> >
> > Do you foresee any problems in having one daemon per node for all
> volumes?
>
> Not technically :). Probably that's a nice thing to do.
>
> >
> >>
> >>>
> >>> 3. I think the algorithm for checksum computation can vary within the
> >>> volume. I see a reference to "Hashtype is persisted along side the
> >>> checksum
> >>> and can be tuned per file type." Is this correct? If so:
> >>>
> >>> a) How will the policy be exposed to the user?
> >>
> >>
> >> Bitrot daemon would have a configuration file that can be configured
> >> via Gluster CLI. Tuning hash types could be based on file types or
> >> file name patterns (regexes) [which is a bit tricky as bitrot would
> >> work on GFIDs rather than filenames, but this can be solved by a level
> >> of indirection].
> >>
> >>>
> >>> b) It would be nice to have the algorithm for computing checksums be
> >>> pluggable. Are there any thoughts on pluggability?
> >>
> >>
> >> Do you mean the default hash algorithm be configurable? If yes, then
> >> that's planned.
> >
> >
> > Sounds good.
> >
> >>
> >>>
> >>> c) What are the steps involved in changing the hashtype/algorithm for a
> >>> file?
> >>
> >>
> >> Policy changes for file {types, patterns} are lazy, i.e., taken into
> >> effect during the next recompute. For objects that are never modified
> >> (after initial checksum compute), scrubbing can recompute the checksum
> >> using the new hash _after_ verifying the integrity of a file with the
> >> old hash.
> >
> >
> >>
> >>>
> >>> 4. Is the fop on which change detection gets triggered configurable?
> >>
> >>
> >> As of now all data modification fops trigger checksum calculation.
> >>
> >
> > Wish I was more clear on this in my OP. Is the fop on which checksum
> > verification/bitrot detection happens configurable? The feature page
> talks
> > about "open" being a trigger point for this. Users might want to trigger
> > detection on a "read" operation and not on open. It would be good to
> provide
> > this flexibility.
>
> Ah! ok. As of now it's mostly open() and read(). Inline verification
> would be "off" by default due to obvious reasons.
>
> >
> >>
> >>>
> >>> 6. Any thoughts on integrating the bitrot repair framework with
> >>> self-heal?
> >>
> >>
> >> There are some thoughts on integration with self-heal daemon and EC.
> >> I'm coming up with a doc which covers those [reason for delay in
> >> replying to your questions ;)]. Expect the doc in in gluster-devel@
> >> soon.
> >
> >
> > Will look forward to this.
> >
> >>
> >>>
> >>> 7. How does detection figure out that lazy updation is still pending
> and
> >>> not
> >>> raise a false positive?
> >>
> >>
> >> That's one of the things that myself and Rachana discussed yesterday.
> >> Should scrubbing *wait* till checksum updating is still in progress or
> >> is it expected that scrubbing happens when there is no active I/O
> >> operations on the volume (both of which imply that bitrot daemon needs
> >> to know when it's done it's job).
> >>
> >> If both scrub and checksum updating go in parallel, then there needs
> >> to be way to synchronize those operations. Maybe, compute checksum on
> >> priority which is provided by the scrub process as a hint (that leaves
> >> little window for rot though) ?
> >>
> >> Any thoughts?
> >
> >
> > Waiting for no active I/O in the volume might be a difficult condition to
> > reach in some deployments.
> >
> > Some form of waiting is necessary to prevent false positives. One
> > possibility might be to mark an object as dirty till checksum updation is
> > complete. Verification/scrub can then be skipped for dirty objects.
>
> Makes sense. Thanks!
>
> >
> > -Vijay
> >
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at gluster.org
> http://supercolony.gluster.org/mailman/listinfo/gluster-devel
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-devel/attachments/20141209/6e78ed78/attachment.html>


More information about the Gluster-devel mailing list