[Gluster-devel] glusterfsd crash with bizarre (dangerous?) results...

Daniel Maher dma+gluster at witbe.net
Fri Apr 4 10:18:07 UTC 2008


Hi all,

While running a series of FFSB tests against my newly-created Gluster
cluster, i caused glusterfsd to crash on one of the two storage nodes.
The relevant lines from the log file are pastebin'd :
http://pastebin.ca/970831


Even more troubling, is that when i restarted glusterfsd, the node
did /not/ self-heal :

The mointpoint on the client :
[dfsA]# du -s /opt/gfs-mount/
2685304 /opt/gfs-mount/

The DS on the node which did not fail :
[dfsC]# du -s /opt/gfs-ds/
2685328 /opt/gfs-ds/

The DS on the node which failed, ~ 5 minutes after restarting
glusterfsd :
[dfsD]# du -s /opt/gfs-ds/
27092   /opt/gfs-ds/


Even MORE troubling, i restarted glusterfsd on the node which did not
fail, to see if that would help - and it created even more bizarre
results :

The mountpoint on the client :
[dfsA]# du -s /opt/gfs-mount/
17520   /opt/gfs-mount/

The DS on the node which did not fail :
[dfsC]# du -s /opt/gfs-ds/
2685328 /opt/gfs-ds/

The DS on the node which failed :
[dfsD]# du -s /opt/gfs-ds/
27092   /opt/gfs-ds/


A simple visual inspection of the files and directories shows that
the files and directories are clearly different between the client and
between the two nodes.  For example :

(Client)
[dfsA]# ls fillfile*
fillfile0   fillfile11  fillfile14  fillfile2  fillfile5  fillfile8
fillfile1   fillfile12  fillfile15  fillfile3  fillfile6  fillfile9
fillfile10  fillfile13  fillfile16  fillfile4  fillfile7
[dfsA]# ls -l fillfile?
-rwx------ 1 root root  65536 2008-04-04 09:42 fillfile0
-rwx------ 1 root root 131072 2008-04-04 09:42 fillfile1
-rwx------ 1 root root 131072 2008-04-04 09:42 fillfile2
-rwx------ 1 root root  65536 2008-04-04 09:42 fillfile3
-rwx------ 1 root root  65536 2008-04-04 09:42 fillfile4
-rwx------ 1 root root  65536 2008-04-04 09:42 fillfile5
-rwx------ 1 root root      0 2008-04-04 09:42 fillfile6
-rwx------ 1 root root      0 2008-04-04 09:42 fillfile7
-rwx------ 1 root root 196608 2008-04-04 09:42 fillfile8
-rwx------ 1 root root      0 2008-04-04 09:42 fillfile9

(Node that didn't fail)
[dfsC]# ls fillfile*
fillfile0   fillfile13  fillfile18  fillfile22  fillfile4  fillfile9
fillfile1   fillfile14  fillfile19  fillfile23  fillfile5
fillfile10  fillfile15  fillfile2   fillfile24  fillfile6
fillfile11  fillfile16  fillfile20  fillfile25  fillfile7
fillfile12  fillfile17  fillfile21  fillfile3   fillfile8
[dfsC]# ls -l fillfile?
-rwx------ 1 root root  65536 2008-04-04 09:42 fillfile0
-rwx------ 1 root root 131072 2008-04-04 09:42 fillfile1
-rwx------ 1 root root 131072 2008-04-04 09:42 fillfile2
-rwx------ 1 root root  65536 2008-04-04 09:42 fillfile3
-rwx------ 1 root root  65536 2008-04-04 09:42 fillfile4
-rwx------ 1 root root  65536 2008-04-04 09:42 fillfile5
-rwx------ 1 root root      0 2008-04-04 09:42 fillfile6
-rwx------ 1 root root      0 2008-04-04 09:42 fillfile7
-rwx------ 1 root root 196608 2008-04-04 09:42 fillfile8
-rwx------ 1 root root      0 2008-04-04 09:42 fillfile9

(Node that failed)
[dfsD]# ls fillfile*
fillfile0   fillfile11  fillfile14  fillfile2  fillfile5  fillfile8
fillfile1   fillfile12  fillfile15  fillfile3  fillfile6  fillfile9
fillfile10  fillfile13  fillfile16  fillfile4  fillfile7
[dfsD]# ls -l fillfile?
-rwx------ 1 root root   65536 2008-04-04 09:08 fillfile0
-rwx------ 1 root root  131072 2008-04-04 09:08 fillfile1
-rwx------ 1 root root 4160139 2008-04-04 09:08 fillfile2
-rwx------ 1 root root  327680 2008-04-04 09:08 fillfile3
-rwx------ 1 root root  262144 2008-04-04 09:08 fillfile4
-rwx------ 1 root root   65536 2008-04-04 09:08 fillfile5
-rwx------ 1 root root 1196446 2008-04-04 09:08 fillfile6
-rwx------ 1 root root  131072 2008-04-04 09:08 fillfile7
-rwx------ 1 root root 3634506 2008-04-04 09:08 fillfile8
-rwx------ 1 root root  131072 2008-04-04 09:08 fillfile9


What the heck is going on here ?  Three wildly different results -
that's really not a good thing.  These results seem "permanent" as well
- after waiting a good 10 minutes (and executing the same du command a
few more times), the results are the same...


Finally, i edited "fillfile6" (0 bytes on dfsA and dfsC, 1196446
bytes on dfsD) via the mountpoint on dfsA, and the changes were
immediately reflected on the storage nodes.  Clearly the AFR translator
is operational /now/, but the enormous discrepancy is not a good thing,
to say the least.



-- 
Daniel Maher <dma AT witbe.net>





More information about the Gluster-devel mailing list