[Gluster-users] 1.4.0RC6 AFR problems

Wed Dec 24 23:21:45 UTC 2008

At 02:45 PM 12/24/2008, Stas Oskin wrote:
>Hi Keith.
>
>Sorry for the previous email, it was a bit not in-place.
>
>Would you mind sharing how you recovered from this issue?

I think Amar's responses to my email will be helpful.
Especially given that some of my issues were bugs that are fixed or 
being fixed, my particular method of recovery wouldn't necessarily apply

>I'm going to stress test a solution based on GlusterFS next week, 
>including pulling live disk offline in middle of work, and would 
>appreciate any hints you might share regarding recovering from the failures.

I think with the fixed bugs, it should be as easy as I expected.
once you have an empty underlying filesystem (with no gluster 
extended attributes), AFR should auto-heal the entire directory 
without a problem.
it tried to do this but hit a bug, which was overcome by setting the 
option favorite-child in the AFR translator.
this isn't necessarily an ideal production run-time configuration, 
but it's reasonable to set this to recover from a drive failure and 
then unset it after the recovery is complete.

as for specifics of forcing auto-heal:
I used the find command from the wiki:
find /GLUSTERMOUNTPOINT -type f -exec head -1 {} \; > /dev/null

it can be interesting if you tail -f the gluster logfile in another 
window while this goes on.

I've found the script "whodir" posted a while go to be helpful to 
when I'm having troubles re-mounting the filesystem when gluster crashes.

--whodir--
#!/bin/sh
DIR=$1
find /proc 2>/dev/null | grep -E 'cwd|exe' | xargs ls -l 2>/dev/null 
| grep "> $DIR" | sed 's/  */ /g' | cut -f8 -d' ' | cut -f3 -d/ | 
sort | uniq | while read line; do echo $line $(cat /proc/$line/cmdline); done

>Regards.
>
>2008/12/23 Keith Freedman 
><<mailto:freedman at freeformit.com>freedman at freeformit.com>
>
>so, I had a drive failure on one of my boxes and it lead to discovery
>of numerous issues today:
>
>1) when a drive is failing and one of the AFR servers is dealing with
>IO errors, the other one freaks out and sometimes crashes, but
>doesn't seem to ever network timeout.
>
>2) when starting gluster on the server with the new empty drive, it
>gave me a bunch of errors about things being out of sync and to
>delete a file from all but the preferred server.
>this struck me as odd, since the thing was empty.
>so I used the favorite child, but this isn't a preferred solution long term.
>
>3) one of the directories had 20GB of data in it.... I went to do an
>ls of the directory and had to wait while it auto-healed all the
>files..  while this is helpful, it would be nice to have gotten back
>the directory listing without having to wait for 20GB of data to get
>sent over the network.
>
>4) while the other server was down, the up server kept failing..
>signal 11?  and I had to constantly remount the filesystem.  It was
>giving me messages about the other node being down which was fine but
>then it'd just die after a while.. consistently.
>
>
>_______________________________________________
>Gluster-users mailing list
><mailto:Gluster-users at gluster.org>Gluster-users at gluster.org
>http://zresearch.com/cgi-bin/mailman/listinfo/gluster-users