[Gluster-devel] Selfheal is not working? Once more

Wed Jul 30 21:42:07 UTC 2008

Previous quotes posts removed for brevity...

Martin Fick wrote:
> It does seem like it would be fairly easy to add another 
> metadata attribute to each file/directory that would hold
> a checksum for it.  This way, AFR itself could be 
> configured to check/compute the checksum anytime the file 
> is read/written.  Since this would slow AFR down, I would
> suggest a configuration option to turn this on.  If the
> checksum is wrong, it could heal to the version of the
> other brick if the other brick's checksum is correct.
> 
> Another alternative would be to create an offline 
> checksummer that updates such an attribute if it does not
> exist, and checks the checksum if it does exist.  If when
> it checks the checksum it fails, it would simply delete the
> file and its attributes (and potentially the directory
> attributes up the tree) so that AFR will then heal it.
> 
> The only modification needed by AFR to support this
> would be to delete the checksum attribute anytime the 
> file/directory is updated so that the offline checksummer
> will recreate it instead of thinking it is corrupt.  
> In fact, even this could be eliminated so that the 
> offline checksummer is completely "self-powered",
> anytime it calculates a checksum it could copy the 
> glusterfs version and timestamp attributes to two new
> "checksummer" attributes.  If these become out of date the 
> cheksummer will know to recompute the checksum instead of
> assuming that the file has been corrupted.
> 
> The one risk with this is that if a file gets corrupted
> on both nodes, it will get deleted on both nodes so you 
> will not have a corrupted file to at least look at.  
> This too could be overcome by saving any deleted files 
> in a separate "trash can" and cleaning the trash can 
> once  the files in it have been healed, sort of a self cleaning lost+found directory.
> 
> 
> I know this may not be the answers that you were 
> looking for, but I hope it helps clarify things 
> a little.

A while back I seem to remember someone talking about eventually 
creating a fsck.glusterfs utility.  Since underlying server node 
corruption would (hopefully) not be a common problem, it seems like a 
specific tool that could be run when prudent would be a good approach. 
If the underlying data is suspected of corruption on a node, run the 
normal fsck on that node, then the fsck.glusterfs on the share utility 
which can utilize a much more comprehensive set of checks and repairs 
than would be feasible in normal AFR file processing.

-- 

-Kevan Benson
-A-1 Networks