[Gluster-users] simple AFR setup, one server crashes, entire cluster becomes unusable ?
Daniel Maher
dma+gluster at witbe.net
Mon Dec 8 10:26:03 UTC 2008
Hello all,
I have been running a four-node (two servers, two clients) server-based
AFR cluster for some time now, the architecture of which is described
fairly accurately by the following Wiki page :
http://www.gluster.org/docs/index.php/High-availability_storage_using_server-side_AFR
In summary, there are two servers and two clients ; the clients are set
up to connect to a single hostname, which is a round-robin DNS entry for
both of the servers.
Last night, glusterfsd on one of the servers crashed (w/ coredump), and
instead of the remaining server being used automatically, the entire
cluster became unusable. The logs for both the remaining functional
server, as well as the clients, are littered with tens of thousands of
error messages, and the mounted shares were not accessible.
It is (was?) my understanding that Gluster is tolerant of faults wherein
one of the nodes becomes inaccessible. Is this or is this not the case ?
Particulars...
Both servers :
[root at server glusterfs]# uname -s -r -o -i
Linux 2.6.25.10-86.fc9.i686 i386 GNU/Linux
[root at server glusterfs]# cat /etc/redhat-release
Fedora release 9 (Sulphur)
GLUSTER CONFIG : http://glusterfs.pastebin.com/m45feb982
Both clients :
[root at client glusterfs]# uname -s -r -o -i
Linux 2.6.24.4 x86_64 GNU/Linux
[root at client glusterfs]# cat /etc/redhat-release
Fedora release 8 (Werewolf)
GLUSTER CONFIG : http://glusterfs.pastebin.com/m48b7dd28
LOGS FROM THE INCIDENT : http://glusterfs.pastebin.com/m72cbc8f5
(excerpts from all four machines)
(note the following from the server that crashed...)
[0x110400]
/usr/lib/libglusterfs.so.0(dict_del+0x2d)[0x808e7d]
/usr/lib/glusterfs/1.3.12/xlator/protocol/client.so(notify+0x21b)[0x126a4b]
/usr/lib/libglusterfs.so.0(transport_notify+0x3d)[0x81374d]
/usr/lib/libglusterfs.so.0(sys_epoll_iteration+0xf9)[0x814779]
/usr/lib/libglusterfs.so.0(poll_iteration+0xa0)[0x8138f0]
[glusterfs](main+0x786)[0x804a156]
/lib/libc.so.6(__libc_start_main+0xe6)[0xb655d6]
[glusterfs][0x8049431]
---------
What could have caused Gluster to crash ? Should the cluster have
continued to function or not ? What, if anything, can be done to
prevent this from happening in the future ?
Thank you, all.
--
Daniel Maher <dma+gluster AT witbe DOT net>
More information about the Gluster-users
mailing list