[Gluster-devel] glusterfsd crash with bizarre (dangerous?) results...
Daniel Maher
dma+gluster at witbe.net
Fri Apr 4 10:18:07 UTC 2008
Hi all,
While running a series of FFSB tests against my newly-created Gluster
cluster, i caused glusterfsd to crash on one of the two storage nodes.
The relevant lines from the log file are pastebin'd :
http://pastebin.ca/970831
Even more troubling, is that when i restarted glusterfsd, the node
did /not/ self-heal :
The mointpoint on the client :
[dfsA]# du -s /opt/gfs-mount/
2685304 /opt/gfs-mount/
The DS on the node which did not fail :
[dfsC]# du -s /opt/gfs-ds/
2685328 /opt/gfs-ds/
The DS on the node which failed, ~ 5 minutes after restarting
glusterfsd :
[dfsD]# du -s /opt/gfs-ds/
27092 /opt/gfs-ds/
Even MORE troubling, i restarted glusterfsd on the node which did not
fail, to see if that would help - and it created even more bizarre
results :
The mountpoint on the client :
[dfsA]# du -s /opt/gfs-mount/
17520 /opt/gfs-mount/
The DS on the node which did not fail :
[dfsC]# du -s /opt/gfs-ds/
2685328 /opt/gfs-ds/
The DS on the node which failed :
[dfsD]# du -s /opt/gfs-ds/
27092 /opt/gfs-ds/
A simple visual inspection of the files and directories shows that
the files and directories are clearly different between the client and
between the two nodes. For example :
(Client)
[dfsA]# ls fillfile*
fillfile0 fillfile11 fillfile14 fillfile2 fillfile5 fillfile8
fillfile1 fillfile12 fillfile15 fillfile3 fillfile6 fillfile9
fillfile10 fillfile13 fillfile16 fillfile4 fillfile7
[dfsA]# ls -l fillfile?
-rwx------ 1 root root 65536 2008-04-04 09:42 fillfile0
-rwx------ 1 root root 131072 2008-04-04 09:42 fillfile1
-rwx------ 1 root root 131072 2008-04-04 09:42 fillfile2
-rwx------ 1 root root 65536 2008-04-04 09:42 fillfile3
-rwx------ 1 root root 65536 2008-04-04 09:42 fillfile4
-rwx------ 1 root root 65536 2008-04-04 09:42 fillfile5
-rwx------ 1 root root 0 2008-04-04 09:42 fillfile6
-rwx------ 1 root root 0 2008-04-04 09:42 fillfile7
-rwx------ 1 root root 196608 2008-04-04 09:42 fillfile8
-rwx------ 1 root root 0 2008-04-04 09:42 fillfile9
(Node that didn't fail)
[dfsC]# ls fillfile*
fillfile0 fillfile13 fillfile18 fillfile22 fillfile4 fillfile9
fillfile1 fillfile14 fillfile19 fillfile23 fillfile5
fillfile10 fillfile15 fillfile2 fillfile24 fillfile6
fillfile11 fillfile16 fillfile20 fillfile25 fillfile7
fillfile12 fillfile17 fillfile21 fillfile3 fillfile8
[dfsC]# ls -l fillfile?
-rwx------ 1 root root 65536 2008-04-04 09:42 fillfile0
-rwx------ 1 root root 131072 2008-04-04 09:42 fillfile1
-rwx------ 1 root root 131072 2008-04-04 09:42 fillfile2
-rwx------ 1 root root 65536 2008-04-04 09:42 fillfile3
-rwx------ 1 root root 65536 2008-04-04 09:42 fillfile4
-rwx------ 1 root root 65536 2008-04-04 09:42 fillfile5
-rwx------ 1 root root 0 2008-04-04 09:42 fillfile6
-rwx------ 1 root root 0 2008-04-04 09:42 fillfile7
-rwx------ 1 root root 196608 2008-04-04 09:42 fillfile8
-rwx------ 1 root root 0 2008-04-04 09:42 fillfile9
(Node that failed)
[dfsD]# ls fillfile*
fillfile0 fillfile11 fillfile14 fillfile2 fillfile5 fillfile8
fillfile1 fillfile12 fillfile15 fillfile3 fillfile6 fillfile9
fillfile10 fillfile13 fillfile16 fillfile4 fillfile7
[dfsD]# ls -l fillfile?
-rwx------ 1 root root 65536 2008-04-04 09:08 fillfile0
-rwx------ 1 root root 131072 2008-04-04 09:08 fillfile1
-rwx------ 1 root root 4160139 2008-04-04 09:08 fillfile2
-rwx------ 1 root root 327680 2008-04-04 09:08 fillfile3
-rwx------ 1 root root 262144 2008-04-04 09:08 fillfile4
-rwx------ 1 root root 65536 2008-04-04 09:08 fillfile5
-rwx------ 1 root root 1196446 2008-04-04 09:08 fillfile6
-rwx------ 1 root root 131072 2008-04-04 09:08 fillfile7
-rwx------ 1 root root 3634506 2008-04-04 09:08 fillfile8
-rwx------ 1 root root 131072 2008-04-04 09:08 fillfile9
What the heck is going on here ? Three wildly different results -
that's really not a good thing. These results seem "permanent" as well
- after waiting a good 10 minutes (and executing the same du command a
few more times), the results are the same...
Finally, i edited "fillfile6" (0 bytes on dfsA and dfsC, 1196446
bytes on dfsD) via the mountpoint on dfsA, and the changes were
immediately reflected on the storage nodes. Clearly the AFR translator
is operational /now/, but the enormous discrepancy is not a good thing,
to say the least.
--
Daniel Maher <dma AT witbe.net>
More information about the Gluster-devel
mailing list