[Gluster-devel] bit rot support for glusterfs design draft v0.1

Thu Jan 30 01:10:03 UTC 2014

Thanks for the info. In terms of managing the process, is this the kind of flow you're looking at. 

- a "vol set" command gets issued to enable the volume for bit-rot detection (so admins can disable for NFS only volumes, for example) 
options like "auto" or "manual" with a default of "disabled" 

- You mention that upon enabling, you'd start the scan straight away. I would also suggest that the initial scan is not initiated once a volume is enabled for bitd. Admins may still want the first pass to occur during a weekend or overnight to avoid the impact of the read/compute cycle on live services - perhaps if they can enable the volume in advance, and then schedule a cron job to initiate the scan for the volume it would be more workable. 

- to support nfs workloads, the manual mode and cron approach can be used. 

This also raises the topic of interruption and restart of the scan on large volumes. What if bitd dumps, gets cancelled or box goes down? 

Are you thinking of instrumenting the progress or state of bitd for a given volume - 

for example 

v ol bitd <vol> status 
Bitd Mode : Intitialising { Initialising, then change to be the same as the vol set option (auto, manual) 
Scan Started : 30 Jan 2014 11:10 
Scan Complete: pending { would show end time, admins could use this as an estimate for other vols in the environment 
Run Time: 1 hr 30secs 
Avg files/sec: 2000 
Total Scanned [G,T]B: 400GB 
Estimated Completion: 2hr { Simple estimate, based on sum of bricks, total scanned and runtime - refresh 

with deferred scanning (NFS) supported by a 
vol bitd <vol> start 

Is this the kind of thing you're thinking of operationally? 

----- Original Message -----

> From: "shishir gowda" <gowda.shishir at gmail.com>
> To: "Paul Cuzner" <pcuzner at redhat.com>
> Cc: gluster-devel at nongnu.org
> Sent: Wednesday, 29 January, 2014 6:59:39 PM
> Subject: Re: [Gluster-devel] bit rot support for glusterfs design draft v0.1

> On 29 January 2014 02:21, Paul Cuzner < pcuzner at redhat.com > wrote:

> > > From: "shishir gowda" < gowda.shishir at gmail.com >
> > 
> 
> > > To: "Paul Cuzner" < pcuzner at redhat.com >
> > 
> 
> > > Cc: gluster-devel at nongnu.org
> > 
> 
> > > Sent: Tuesday, 28 January, 2014 6:13:13 PM
> > 
> 
> > > Subject: Re: [Gluster-devel] bit rot support for glusterfs design draft
> > > v0.1
> > 
> 

> > > On 28 January 2014 03:48, Paul Cuzner < pcuzner at redhat.com > wrote:
> > 
> 
> > > >
> > 
> 
> > > >
> > 
> 
> > > > ________________________________
> > 
> 
> > > >
> > 
> 
> > > > From: "shishir gowda" < gowda.shishir at gmail.com >
> > 
> 
> > > > To: gluster-devel at nongnu.org
> > 
> 
> > > > Sent: Monday, 27 January, 2014 6:30:13 PM
> > 
> 
> > > > Subject: [Gluster-devel] bit rot support for glusterfs design draft
> > > > v0.1
> > 
> 
> > > >
> > 
> 
> > > >
> > 
> 
> > > > Hi All,
> > 
> 
> > > >
> > 
> 
> > > > Please find the updated bit-rot design for glusterfs volumes.
> > 
> 
> > > >
> > 
> 
> > > > Thanks to Vijay Bellur for his valuable inputs in the design.
> > 
> 
> > > >
> > 
> 
> > > > Phase 1: File level bit rot detection
> > 
> 
> > > >
> > 
> 
> > > > The initial approach is to achieve bit rot detection at file level,
> > 
> 
> > > > where checksum is computed for a complete file, and checked during
> > 
> 
> > > > access.
> > 
> 
> > > >
> > 
> 
> > > > A single daemon(say BitD) per node will be responsible for all the
> > 
> 
> > > > bricks of the node. This daemon, will be registered to the gluster
> > 
> 
> > > > management daemon, and any graph changes
> > 
> 
> > > > (add-brick/remove-brick/replace-brick/stop bit-rot) will be handles
> > 
> 
> > > > accordingly. This BitD will register with changelog xlator of all the
> > 
> 
> > > > bricks for the node, and process changes from them.
> > 
> 
> > > >
> > 
> 
> > > >
> > 
> 
> > > > Doesn't having a single daemon for all bricks, instead of a per brick
> > > > 'bitd'
> > 
> 
> > > > introduce the potential of a performance bottleneck?
> > 
> 
> > > >
> > 
> 
> > > >
> > 
> 
> > > Most of the current gluster related daemons work in this mode
> > 
> 
> > > (nfs/selfheal/quota). Additionally, if we introduce 1:1 mapping
> > 
> 
> > > between a brick and bitd, then managing these daemons will be bring in
> > 
> 
> > > their own over heads.
> > 
> 

> > OK - but bitd is in the "I/O path" isn't it and it's compute intensive -
> > which is why I'm concerned about scale and the potential impact to latency.
> 

> Bitd would not be in the direct I/O path. Bitd task would only be in
> identifying files (from changelog) for which checksum needs to be computed.
> Enforcement of detection would be on the server/brick itself (as a xlator
> sitting ontop of the posix layer)

> > If NFS access is going to be problematic - why not exclude them from
> > interactive checksums, and resort to batch so the admin can keep
> > performance
> > up and schedule a scrub time where applications/users are not affected
> 

> Turning off for NFS could be an option, but ignoring NFS completely would not
> be a good stratergy

> > > > Change log xlator, would give the list of files (in terms of gfid)
> > 
> 
> > > > which have changed during a defined interval. Checksum's would have to
> > 
> 
> > > > be computed for these based on either fd close() call for non NFS
> > 
> 
> > > > access, or every write for anonymous fd access (NFS). The computed
> > 
> 
> > > > checksum in addition to the timestamp of the computation would be
> > 
> 
> > > > saved as a extended-attribute (xattr) of the file. By using change-log
> > 
> 
> > > > xlators, we would prevent periodic scans of the bricks, to identify
> > 
> 
> > > > the files whose checksums need to be updated.
> > 
> 
> > > >
> > 
> 
> > with the checksum update being based on the close() - what happens to
> > environments like ovirt or openstack.
> 

> > It would be great to understand if there are use cases for gluster which
> > the
> > bit-rot plan addresses, and by consequence identify which use cases if any
> > that would be problematic/impractical.
> 

> Typical use case for a bitrot would be for archival purposes. If we start
> enforcing checksum on a open file, we could end up with thrashing, were
> checksums are invalid, as data has been written, but checksum are not
> updated. That is the reason why we would prefer to move into block based
> checksum. In that scenario, we would update block checksum for every write,
> making sure we handle such scenarios.

> > > >
> > 
> 
> > > > Using the changelog is a great idea, but I'd also see a requirement for
> > > > an
> > 
> 
> > > > admin initiated full scan at least when bringing existing volumes under
> > > > bitd
> > 
> 
> > > > control.
> > 
> 
> > > >
> > 
> 

> > > Sorry, failed to mention it. Once bitrot is turned on, a full scan of
> > 
> 
> > > each bricks are started.
> > 
> 

> > > > Also, what's the flow if the xattr is unreadable, due to bit rot. In
> > > > btrfs
> > 
> 
> > > > meta data is typically mirrored.
> > 
> 
> > > >
> > 
> 

> > > Currently, if xattr is unreadable, we would treat it as a failure from
> > 
> 
> > > the brick end. If the volume
> > 
> 
> > > is replicated, then other brick might be able to serve the file
> > 
> 

> > > >
> > 
> 
> > > > Upon access (open for non-anonymous-fd calls, every read for
> > 
> 
> > > > anonymous-fd calls) from any clients, the bit rot detection xlator
> > 
> 
> > > > loaded ontop of the bricks, would recompute the checksum of the file,
> > 
> 
> > > > and allow the calls to proceed if they match, or fail them if they
> > 
> 
> > > > mis-match. This introduces extra workload for NFS workloads, and for
> > 
> 
> > > > large files which require read of the complete file to recompute the
> > 
> 
> > > > checksum(we try to solve this in phase-2).
> > 
> 
> > > >
> > 
> 
> > > > every read..? That's sounds like such an overhead, admins would just
> > > > turn
> > > > it
> > 
> 
> > > > off.
> > 
> 
> > > >
> > 
> 

> > > NFS does not send open calls, and sends read calls directly on
> > 
> 
> > > anonymous fd's. On such occasions, for anonymousfd reads, we will have
> > 
> 
> > > to do checksum for every read. This one of the reasons why in phase 2
> > 
> 
> > > we want block level checksum to prevent read of the complete file for
> > 
> 
> > > any read.
> > 
> 

> > > > I assume failing a read due to checksum inconsistency in a replicated
> > > > volume
> > 
> 
> > > > would trigger one of the other replica's to be used, so the issue is
> > 
> 
> > > > transparent to the end user/application.
> > 
> 
> > > >
> > 
> 
> > > >
> > 
> 
> > > That is the expected behaviour.
> > 
> 

> > > >
> > 
> 
> > > > Since a data write happens first, followed by a delayed checksum
> > 
> 
> > > > compute, there is a time frame where we might have data updated, but
> > 
> 
> > > > checksums yet to be computed. We should allow the access of such files
> > 
> 
> > > > if the file timestamps (mtime) has changed, and is within a defined
> > 
> 
> > > > range from the current time.
> > 
> 
> > > >
> > 
> 
> > > > Additionally, we could/should have the ability to switch of checksum
> > 
> 
> > > > compute from glusterfs perspective, if the underlying FS
> > 
> 
> > > > exposes/implements bit-rot detection(btrfs).
> > 
> 
> > > >
> > 
> 
> > > > +1 Why re-invent the wheel!
> > 
> 
> > > >
> > 
> 
> > > >
> > 
> 
> > > > Phase 2: Block-level(User space/defined) bit rot detection and
> > > > correction.
> > 
> 
> > > >
> > 
> 
> > > > The eventual aim is to be able to heal/correct bit rots in files. To
> > 
> 
> > > > achieve this, computing checksum at a more fine grain level like a
> > 
> 
> > > > block (size limited by the bit rot algorithm), so that we not only
> > 
> 
> > > > detect bit rots, but also have the ability to restore them.
> > 
> 
> > > > Additionally, for large files, checking the checksums at block level
> > 
> 
> > > > is more efficient, rather than recompute the checksum of the whole
> > 
> 
> > > > file for a an access.
> > 
> 
> > > >
> > 
> 
> > > >
> > 
> 
> > > > In this phase, we could move the checksum computation phase to the
> > 
> 
> > > > xlator loaded on-top of the posix translator at each bricks. with
> > 
> 
> > > > every write, we could compute the checksum, and store the checksum and
> > 
> 
> > > > continue with the write or vice versa. Every access would also be able
> > 
> 
> > > > to read/compute the checksum of the requested block, check it with the
> > 
> 
> > > > save checksum of the block, and act accordingly. This would take away
> > 
> 
> > > > the dependency on the external BitD, and changelog xlator.
> > 
> 
> > > >
> > 
> 
> > > > Additionally, using a Error-correcting code(ECC) or
> > 
> 
> > > > Forward-error-correction(FEC) alogrithm, would enable us the correct
> > 
> 
> > > > few bits in the block which have gone corrupt. And compute of the
> > 
> 
> > > > complete files checksum is eliminated, as we are dealing with blocks
> > 
> 
> > > > of defined size.
> > 
> 
> > > >
> > 
> 
> > > > We require the ability to store these fine-grained checksums
> > 
> 
> > > > efficiently, and extended attributes would not scale for this
> > 
> 
> > > > implementation. Either a custom backed store, or a DB would be
> > 
> 
> > > > preferrable in this instance.
> > 
> 
> > > >
> > 
> 
> > > > so if there is a per 'block' checksum, won't our capacity overheads
> > > > increase
> > 
> 
> > > > to store the extra meta data, ontop of our existing replication/raid
> > 
> 
> > > > overhead?
> > 
> 
> > > >
> > 
> 

> > > That is true. But, we need to address bit rot at brick level.
> > 
> 

> > > > Where does Xavi's disperse volume fit into this? Would an Erasure Coded
> > 
> 
> > > > volume lend itself easier to those use cases (cold data) where bit rot
> > > > is
> > 
> 
> > > > key consideration?
> > 
> 
> > > >
> > 
> 
> > > > If so, would a more simple bit rot strategy for gluster be
> > 
> 
> > > > 1) disperse volume
> > 
> 
> > > > 2) btrfs checksums + plumbing to trigger heal when scrub detects a
> > > > problem
> > 
> 
> > > >
> > 
> 
> > > > I like simple :)
> > 
> 
> > > >
> > 
> 
> > > >
> > 
> 
> > > We haven't explored dispers/ or any other cluster xlators impact. The
> > 
> 
> > > idea here is irrespective of the clustering mechanisms, bit rot is
> > 
> 
> > > at the brick level and independent. So, in the future if the volume
> > 
> 
> > > type changes, bit rot can still exist.
> > 
> 
> > Not sure what you mean by clustering mechanisms? My comment about disperse
> > is
> > that this is a volume that already uses erasure coding that includes
> > healing, so by design it lends itself to the cold data use cases that are
> > more open to silent corruption
> 

> As a general rule, each xlator/feature should be independent of others. So if
> erasure coding includes such behaviour, we might prevent bit rot from being
> turned on for such volumes(at the management layer).

> > > btrfs bypass will be provided, but it wont be the only backend. So,
> > 
> 
> > > gluster has to do its own bitrot detection and correction when
> > 
> 
> > > possible.
> > 
> 
> > Agree.
> 

> > > >
> > 
> 
> > > > Please feel free to comment/critique.
> > 
> 
> > > >
> > 
> 
> > > > With regards,
> > 
> 
> > > > Shishir
> > 
> 
> > > >
> > 
> 
> > > > _______________________________________________
> > 
> 
> > > > Gluster-devel mailing list
> > 
> 
> > > > Gluster-devel at nongnu.org
> > 
> 
> > > > https://lists.nongnu.org/mailman/listinfo/gluster-devel
> > 
> 
> > > >
> > 
> 
> > > >
> > 
> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-devel/attachments/20140129/7da29441/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: undefined
Type: image/gif
Size: 343 bytes
Desc: not available
URL: <http://supercolony.gluster.org/pipermail/gluster-devel/attachments/20140129/7da29441/attachment-0001.gif>