[Gluster-devel] bit rot support for glusterfs design draft v0.1

Paul Cuzner pcuzner at redhat.com
Tue Jan 28 20:51:52 UTC 2014


----- Original Message -----

> From: "shishir gowda" <gowda.shishir at gmail.com>
> To: "Paul Cuzner" <pcuzner at redhat.com>
> Cc: gluster-devel at nongnu.org
> Sent: Tuesday, 28 January, 2014 6:13:13 PM
> Subject: Re: [Gluster-devel] bit rot support for glusterfs design draft v0.1

> On 28 January 2014 03:48, Paul Cuzner <pcuzner at redhat.com> wrote:
> >
> >
> > ________________________________
> >
> > From: "shishir gowda" <gowda.shishir at gmail.com>
> > To: gluster-devel at nongnu.org
> > Sent: Monday, 27 January, 2014 6:30:13 PM
> > Subject: [Gluster-devel] bit rot support for glusterfs design draft v0.1
> >
> >
> > Hi All,
> >
> > Please find the updated bit-rot design for glusterfs volumes.
> >
> > Thanks to Vijay Bellur for his valuable inputs in the design.
> >
> > Phase 1: File level bit rot detection
> >
> > The initial approach is to achieve bit rot detection at file level,
> > where checksum is computed for a complete file, and checked during
> > access.
> >
> > A single daemon(say BitD) per node will be responsible for all the
> > bricks of the node. This daemon, will be registered to the gluster
> > management daemon, and any graph changes
> > (add-brick/remove-brick/replace-brick/stop bit-rot) will be handles
> > accordingly. This BitD will register with changelog xlator of all the
> > bricks for the node, and process changes from them.
> >
> >
> > Doesn't having a single daemon for all bricks, instead of a per brick
> > 'bitd'
> > introduce the potential of a performance bottleneck?
> >
> >
> Most of the current gluster related daemons work in this mode
> (nfs/selfheal/quota). Additionally, if we introduce 1:1 mapping
> between a brick and bitd, then managing these daemons will be bring in
> their own over heads.
OK - but bitd is in the "I/O path" isn't it and it's compute intensive - which is why I'm concerned about scale and the potential impact to latency. 

If NFS access is going to be problematic - why not exclude them from interactive checksums, and resort to batch so the admin can keep performance up and schedule a scrub time where applications/users are not affected 

> > Change log xlator, would give the list of files (in terms of gfid)
> > which have changed during a defined interval. Checksum's would have to
> > be computed for these based on either fd close() call for non NFS
> > access, or every write for anonymous fd access (NFS). The computed
> > checksum in addition to the timestamp of the computation would be
> > saved as a extended-attribute (xattr) of the file. By using change-log
> > xlators, we would prevent periodic scans of the bricks, to identify
> > the files whose checksums need to be updated.
> >
with the checksum update being based on the close() - what happens to environments like ovirt or openstack. 

It would be great to understand if there are use cases for gluster which the bit-rot plan addresses, and by consequence identify which use cases if any that would be problematic/impractical. 

> >
> > Using the changelog is a great idea, but I'd also see a requirement for an
> > admin initiated full scan at least when bringing existing volumes under
> > bitd
> > control.
> >

> Sorry, failed to mention it. Once bitrot is turned on, a full scan of
> each bricks are started.

> > Also, what's the flow if the xattr is unreadable, due to bit rot. In btrfs
> > meta data is typically mirrored.
> >

> Currently, if xattr is unreadable, we would treat it as a failure from
> the brick end. If the volume
> is replicated, then other brick might be able to serve the file

> >
> > Upon access (open for non-anonymous-fd calls, every read for
> > anonymous-fd calls) from any clients, the bit rot detection xlator
> > loaded ontop of the bricks, would recompute the checksum of the file,
> > and allow the calls to proceed if they match, or fail them if they
> > mis-match. This introduces extra workload for NFS workloads, and for
> > large files which require read of the complete file to recompute the
> > checksum(we try to solve this in phase-2).
> >
> > every read..? That's sounds like such an overhead, admins would just turn
> > it
> > off.
> >

> NFS does not send open calls, and sends read calls directly on
> anonymous fd's. On such occasions, for anonymousfd reads, we will have
> to do checksum for every read. This one of the reasons why in phase 2
> we want block level checksum to prevent read of the complete file for
> any read.

> > I assume failing a read due to checksum inconsistency in a replicated
> > volume
> > would trigger one of the other replica's to be used, so the issue is
> > transparent to the end user/application.
> >
> >
> That is the expected behaviour.

> >
> > Since a data write happens first, followed by a delayed checksum
> > compute, there is a time frame where we might have data updated, but
> > checksums yet to be computed. We should allow the access of such files
> > if the file timestamps (mtime) has changed, and is within a defined
> > range from the current time.
> >
> > Additionally, we could/should have the ability to switch of checksum
> > compute from glusterfs perspective, if the underlying FS
> > exposes/implements bit-rot detection(btrfs).
> >
> > +1 Why re-invent the wheel!
> >
> >
> > Phase 2: Block-level(User space/defined) bit rot detection and correction.
> >
> > The eventual aim is to be able to heal/correct bit rots in files. To
> > achieve this, computing checksum at a more fine grain level like a
> > block (size limited by the bit rot algorithm), so that we not only
> > detect bit rots, but also have the ability to restore them.
> > Additionally, for large files, checking the checksums at block level
> > is more efficient, rather than recompute the checksum of the whole
> > file for a an access.
> >
> >
> > In this phase, we could move the checksum computation phase to the
> > xlator loaded on-top of the posix translator at each bricks. with
> > every write, we could compute the checksum, and store the checksum and
> > continue with the write or vice versa. Every access would also be able
> > to read/compute the checksum of the requested block, check it with the
> > save checksum of the block, and act accordingly. This would take away
> > the dependency on the external BitD, and changelog xlator.
> >
> > Additionally, using a Error-correcting code(ECC) or
> > Forward-error-correction(FEC) alogrithm, would enable us the correct
> > few bits in the block which have gone corrupt. And compute of the
> > complete files checksum is eliminated, as we are dealing with blocks
> > of defined size.
> >
> > We require the ability to store these fine-grained checksums
> > efficiently, and extended attributes would not scale for this
> > implementation. Either a custom backed store, or a DB would be
> > preferrable in this instance.
> >
> > so if there is a per 'block' checksum, won't our capacity overheads
> > increase
> > to store the extra meta data, ontop of our existing replication/raid
> > overhead?
> >

> That is true. But, we need to address bit rot at brick level.

> > Where does Xavi's disperse volume fit into this? Would an Erasure Coded
> > volume lend itself easier to those use cases (cold data) where bit rot is
> > key consideration?
> >
> > If so, would a more simple bit rot strategy for gluster be
> > 1) disperse volume
> > 2) btrfs checksums + plumbing to trigger heal when scrub detects a problem
> >
> > I like simple :)
> >
> >
> We haven't explored dispers/ or any other cluster xlators impact. The
> idea here is irrespective of the clustering mechanisms, bit rot is
> at the brick level and independent. So, in the future if the volume
> type changes, bit rot can still exist.
Not sure what you mean by clustering mechanisms? My comment about disperse is that this is a volume that already uses erasure coding that includes healing, so by design it lends itself to the cold data use cases that are more open to silent corruption 

> btrfs bypass will be provided, but it wont be the only backend. So,
> gluster has to do its own bitrot detection and correction when
> possible.
Agree. 

> >
> > Please feel free to comment/critique.
> >
> > With regards,
> > Shishir
> >
> > _______________________________________________
> > Gluster-devel mailing list
> > Gluster-devel at nongnu.org
> > https://lists.nongnu.org/mailman/listinfo/gluster-devel
> >
> >
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-devel/attachments/20140128/51f56a8e/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: undefined
Type: image/gif
Size: 343 bytes
Desc: not available
URL: <http://supercolony.gluster.org/pipermail/gluster-devel/attachments/20140128/51f56a8e/attachment-0001.gif>


More information about the Gluster-devel mailing list