[Gluster-devel] glusterfsd crash with bizarre (dangerous?) results...

Anand Avati avati at zresearch.com
Fri Apr 4 10:37:15 UTC 2008


Daniel,
 what is the tla revision of your software? (seen from glusterfs --version)

avati

2008/4/4, Daniel Maher <dma+gluster at witbe.net <dma%2Bgluster at witbe.net>>:
>
>
> Hi all,
>
> While running a series of FFSB tests against my newly-created Gluster
> cluster, i caused glusterfsd to crash on one of the two storage nodes.
> The relevant lines from the log file are pastebin'd :
> http://pastebin.ca/970831
>
>
> Even more troubling, is that when i restarted glusterfsd, the node
> did /not/ self-heal :
>
> The mointpoint on the client :
> [dfsA]# du -s /opt/gfs-mount/
> 2685304 /opt/gfs-mount/
>
> The DS on the node which did not fail :
> [dfsC]# du -s /opt/gfs-ds/
> 2685328 /opt/gfs-ds/
>
> The DS on the node which failed, ~ 5 minutes after restarting
> glusterfsd :
> [dfsD]# du -s /opt/gfs-ds/
> 27092   /opt/gfs-ds/
>
>
> Even MORE troubling, i restarted glusterfsd on the node which did not
> fail, to see if that would help - and it created even more bizarre
> results :
>
> The mountpoint on the client :
> [dfsA]# du -s /opt/gfs-mount/
> 17520   /opt/gfs-mount/
>
> The DS on the node which did not fail :
> [dfsC]# du -s /opt/gfs-ds/
> 2685328 /opt/gfs-ds/
>
> The DS on the node which failed :
> [dfsD]# du -s /opt/gfs-ds/
> 27092   /opt/gfs-ds/
>
>
> A simple visual inspection of the files and directories shows that
> the files and directories are clearly different between the client and
> between the two nodes.  For example :
>
> (Client)
> [dfsA]# ls fillfile*
> fillfile0   fillfile11  fillfile14  fillfile2  fillfile5  fillfile8
> fillfile1   fillfile12  fillfile15  fillfile3  fillfile6  fillfile9
> fillfile10  fillfile13  fillfile16  fillfile4  fillfile7
> [dfsA]# ls -l fillfile?
> -rwx------ 1 root root  65536 2008-04-04 09:42 fillfile0
> -rwx------ 1 root root 131072 2008-04-04 09:42 fillfile1
> -rwx------ 1 root root 131072 2008-04-04 09:42 fillfile2
> -rwx------ 1 root root  65536 2008-04-04 09:42 fillfile3
> -rwx------ 1 root root  65536 2008-04-04 09:42 fillfile4
> -rwx------ 1 root root  65536 2008-04-04 09:42 fillfile5
> -rwx------ 1 root root      0 2008-04-04 09:42 fillfile6
> -rwx------ 1 root root      0 2008-04-04 09:42 fillfile7
> -rwx------ 1 root root 196608 2008-04-04 09:42 fillfile8
> -rwx------ 1 root root      0 2008-04-04 09:42 fillfile9
>
> (Node that didn't fail)
> [dfsC]# ls fillfile*
> fillfile0   fillfile13  fillfile18  fillfile22  fillfile4  fillfile9
> fillfile1   fillfile14  fillfile19  fillfile23  fillfile5
> fillfile10  fillfile15  fillfile2   fillfile24  fillfile6
> fillfile11  fillfile16  fillfile20  fillfile25  fillfile7
> fillfile12  fillfile17  fillfile21  fillfile3   fillfile8
> [dfsC]# ls -l fillfile?
> -rwx------ 1 root root  65536 2008-04-04 09:42 fillfile0
> -rwx------ 1 root root 131072 2008-04-04 09:42 fillfile1
> -rwx------ 1 root root 131072 2008-04-04 09:42 fillfile2
> -rwx------ 1 root root  65536 2008-04-04 09:42 fillfile3
> -rwx------ 1 root root  65536 2008-04-04 09:42 fillfile4
> -rwx------ 1 root root  65536 2008-04-04 09:42 fillfile5
> -rwx------ 1 root root      0 2008-04-04 09:42 fillfile6
> -rwx------ 1 root root      0 2008-04-04 09:42 fillfile7
> -rwx------ 1 root root 196608 2008-04-04 09:42 fillfile8
> -rwx------ 1 root root      0 2008-04-04 09:42 fillfile9
>
> (Node that failed)
> [dfsD]# ls fillfile*
> fillfile0   fillfile11  fillfile14  fillfile2  fillfile5  fillfile8
> fillfile1   fillfile12  fillfile15  fillfile3  fillfile6  fillfile9
> fillfile10  fillfile13  fillfile16  fillfile4  fillfile7
> [dfsD]# ls -l fillfile?
> -rwx------ 1 root root   65536 2008-04-04 09:08 fillfile0
> -rwx------ 1 root root  131072 2008-04-04 09:08 fillfile1
> -rwx------ 1 root root 4160139 2008-04-04 09:08 fillfile2
> -rwx------ 1 root root  327680 2008-04-04 09:08 fillfile3
> -rwx------ 1 root root  262144 2008-04-04 09:08 fillfile4
> -rwx------ 1 root root   65536 2008-04-04 09:08 fillfile5
> -rwx------ 1 root root 1196446 2008-04-04 09:08 fillfile6
> -rwx------ 1 root root  131072 2008-04-04 09:08 fillfile7
> -rwx------ 1 root root 3634506 2008-04-04 09:08 fillfile8
> -rwx------ 1 root root  131072 2008-04-04 09:08 fillfile9
>
>
> What the heck is going on here ?  Three wildly different results -
> that's really not a good thing.  These results seem "permanent" as well
> - after waiting a good 10 minutes (and executing the same du command a
> few more times), the results are the same...
>
>
> Finally, i edited "fillfile6" (0 bytes on dfsA and dfsC, 1196446
> bytes on dfsD) via the mountpoint on dfsA, and the changes were
> immediately reflected on the storage nodes.  Clearly the AFR translator
> is operational /now/, but the enormous discrepancy is not a good thing,
> to say the least.
>
>
>
>
> --
> Daniel Maher <dma AT witbe.net>
>
>
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at nongnu.org
> http://lists.nongnu.org/mailman/listinfo/gluster-devel
>



-- 
If I traveled to the end of the rainbow
As Dame Fortune did intend,
Murphy would be there to tell me
The pot's at the other end.



More information about the Gluster-devel mailing list