[Gluster-users] simple AFR setup, one server crashes, entire cluster becomes unusable ?

Mon Dec 8 10:26:03 UTC 2008

Hello all,

I have been running a four-node (two servers, two clients) server-based 
AFR cluster for some time now, the architecture of which is described 
fairly accurately by the following Wiki page :
http://www.gluster.org/docs/index.php/High-availability_storage_using_server-side_AFR

In summary, there are two servers and two clients ; the clients are set 
up to connect to a single hostname, which is a round-robin DNS entry for 
both of the servers.

Last night, glusterfsd on one of the servers crashed (w/ coredump), and 
instead of the remaining server being used automatically, the entire 
cluster became unusable.  The logs for both the remaining functional 
server, as well as the clients, are littered with tens of thousands of 
error messages, and the mounted shares were not accessible.

It is (was?) my understanding that Gluster is tolerant of faults wherein 
one of the nodes becomes inaccessible.  Is this or is this not the case ?

Particulars...

Both servers :
[root at server glusterfs]# uname -s -r -o -i
Linux 2.6.25.10-86.fc9.i686 i386 GNU/Linux
[root at server glusterfs]# cat /etc/redhat-release
Fedora release 9 (Sulphur)
GLUSTER CONFIG : http://glusterfs.pastebin.com/m45feb982

Both clients :
[root at client glusterfs]# uname -s -r -o -i
Linux 2.6.24.4 x86_64 GNU/Linux
[root at client glusterfs]# cat /etc/redhat-release
Fedora release 8 (Werewolf)
GLUSTER CONFIG : http://glusterfs.pastebin.com/m48b7dd28

LOGS FROM THE INCIDENT : http://glusterfs.pastebin.com/m72cbc8f5
(excerpts from all four machines)

(note the following from the server that crashed...)
[0x110400]
/usr/lib/libglusterfs.so.0(dict_del+0x2d)[0x808e7d]
/usr/lib/glusterfs/1.3.12/xlator/protocol/client.so(notify+0x21b)[0x126a4b]
/usr/lib/libglusterfs.so.0(transport_notify+0x3d)[0x81374d]
/usr/lib/libglusterfs.so.0(sys_epoll_iteration+0xf9)[0x814779]
/usr/lib/libglusterfs.so.0(poll_iteration+0xa0)[0x8138f0]
[glusterfs](main+0x786)[0x804a156]
/lib/libc.so.6(__libc_start_main+0xe6)[0xb655d6]
[glusterfs][0x8049431]
---------

What could have caused Gluster to crash ?  Should the cluster have 
continued to function or not ?  What, if anything, can be done to 
prevent this from happening in the future ?

Thank you, all.

-- 
Daniel Maher <dma+gluster AT witbe DOT net>