[Gluster-devel] Selfheal is not working? Once more
kbenson at a-1networks.com
Wed Jul 30 21:42:07 UTC 2008
Previous quotes posts removed for brevity...
Martin Fick wrote:
> It does seem like it would be fairly easy to add another
> metadata attribute to each file/directory that would hold
> a checksum for it. This way, AFR itself could be
> configured to check/compute the checksum anytime the file
> is read/written. Since this would slow AFR down, I would
> suggest a configuration option to turn this on. If the
> checksum is wrong, it could heal to the version of the
> other brick if the other brick's checksum is correct.
> Another alternative would be to create an offline
> checksummer that updates such an attribute if it does not
> exist, and checks the checksum if it does exist. If when
> it checks the checksum it fails, it would simply delete the
> file and its attributes (and potentially the directory
> attributes up the tree) so that AFR will then heal it.
> The only modification needed by AFR to support this
> would be to delete the checksum attribute anytime the
> file/directory is updated so that the offline checksummer
> will recreate it instead of thinking it is corrupt.
> In fact, even this could be eliminated so that the
> offline checksummer is completely "self-powered",
> anytime it calculates a checksum it could copy the
> glusterfs version and timestamp attributes to two new
> "checksummer" attributes. If these become out of date the
> cheksummer will know to recompute the checksum instead of
> assuming that the file has been corrupted.
> The one risk with this is that if a file gets corrupted
> on both nodes, it will get deleted on both nodes so you
> will not have a corrupted file to at least look at.
> This too could be overcome by saving any deleted files
> in a separate "trash can" and cleaning the trash can
> once the files in it have been healed, sort of a self cleaning lost+found directory.
> I know this may not be the answers that you were
> looking for, but I hope it helps clarify things
> a little.
A while back I seem to remember someone talking about eventually
creating a fsck.glusterfs utility. Since underlying server node
corruption would (hopefully) not be a common problem, it seems like a
specific tool that could be run when prudent would be a good approach.
If the underlying data is suspected of corruption on a node, run the
normal fsck on that node, then the fsck.glusterfs on the share utility
which can utilize a much more comprehensive set of checks and repairs
than would be feasible in normal AFR file processing.
More information about the Gluster-devel