[Gluster-users] Gluster crashes when cascading AFR

Mon Dec 15 15:03:51 UTC 2008

Hello all,

I am trying to set up a file replication scheme for our cluster of about
2000 nodes. I'm not sure if what i am doing is actually feasible, so
I'll best just start from the beginning and maybe one of you knows even
a better way to do this.

Basically we have a farm of 2000 machines which are running a certain
application that, during start up, reads about 300 MB of data (out of a
6 GB repository) of program libraries, geometry data etc and this 8
times per node. Once per core on every machine. The data is not modified
by the program so it can be regarded as read only. When the application
is launched it is launched on all nodes simultaneously and especially
now during debugging this is done very often (within minutes).

Up until now we are using NFS to export the software from a central
server, but we are not really happy with it. The performance is lousy,
nfs keeps crashing the repository server and if it does not crash, it
regularly claims that random files do not exist, while they are
obviously there because all the other nodes start up just fine.

Fortunately each of the nodes came with a hard drive which is not used
right now so i thought about replicating the software to each node and
just read everything in from there -> enter glusterfs.

The setup i have created now looks like this:

                repository server
                    (Node A) 
               (AFR to nodes below)
                      /|\
                     / | \
                    /  |  \
                ----   |   ----
              /        |        \
             /         |         \
        (Node B)     ..... 50 x Node B
  (AFR to nodes below)
           /|\
          / | \
         /  |  \
   (Node C) .... 40 x Node C
       |
       |
  Local FS on
 Compute Nodes

The idea is that the repository server mounts a gluster fs that is AFRed
to the 50 nodes of type B and each of the B type nodes has 40 of the
compute nodes below it, AFRing again to the 40 nodes below it. This way,
whenever a new release of the application is available, it just needs to
be copied to the gluster fs on the repository server and should be
replicated to the 2000 compute nodes. If any parts of the new release
need a "hotfix" they can just be modified directly on the repository
server, without having to roll out the whole application to the compute
nodes again. Since the data is not modified by the application it does
not interfere with the file replication scheme and i can just read it in
from the locally mounted fs on the compute nodes.

Now here is the problem:
Whenever i start gluster as client on Node A, all gluster processes on
the nodes of type B crash. Log of the crash:

> 2008-12-15 11:32:28 D [tcp-server.c:145:tcp_server_notify] server: Registering socket (7) for new transport object of 10.128.2.2
> 2008-12-15 11:32:28 D [ip.c:120:gf_auth] head: allowed = "*", received ip addr = "10.128.2.2"
> 2008-12-15 11:32:28 D [server-protocol.c:5674:mop_setvolume] server: accepted client from 10.128.2.2:1023
> 2008-12-15 11:32:28 D [server-protocol.c:5717:mop_setvolume] server: creating inode table with lru_limit=1024, xlator=head
> 2008-12-15 11:32:28 D [inode.c:1163:inode_table_new] head: creating new inode table with lru_limit=1024, sizeof(inode_t)=156
> 2008-12-15 11:32:28 D [inode.c:577:__create_inode] head/inode: create inode(1)
> 2008-12-15 11:32:28 D [inode.c:367:__active_inode] head/inode: activating inode(1), lru=0/1024
> 2008-12-15 11:32:28 D [afr.c:950:afr_setxattr] head: AFRDEBUG:loc->path = /
> 
> TLA Repo Revision: glusterfs--mainline--2.5--patch-797
> Time : 2008-12-15 11:32:28
> Signal Number : 11
> 
> glusterfsd -f /home/rainer/sources/gluster/sw-farmctl.hlta01.vol -l /var/log/glusterfs/glusterfsd.log -L DEBUG
> volume server
>   type protocol/server
>   option auth.ip.head.allow *
>   option transport-type tcp/server
>   subvolumes head
> end-volume
> 
> volume head
>   type cluster/afr
>   option debug on
>   subvolumes local-brick hlta0101-client
> end-volume
> 
> volume hlta0101-client
>   type protocol/client
>   option remote-subvolume sw-brick
>   option remote-host hlta0101
>   option transport-type tcp/client
> end-volume
> 
> volume local-brick
>   type storage/posix
>   option directory /localdisk/gluster/sw
> end-volume
> frame : type(1) op(19)
> 
> /lib64/tls/libc.so.6[0x3d09b2e300]
> /lib64/tls/libc.so.6(memcpy+0x60)[0x3d09b725b0]
> /usr/lib64/glusterfs/1.3.12/xlator/cluster/afr.so(afr_setxattr+0x207)[0x2a95797e57]
> /usr/lib64/glusterfs/1.3.12/xlator/protocol/server.so(server_setxattr_resume+0xc6)[0x2a958b7846]
> /usr/lib64/libglusterfs.so.0(call_resume+0xf58)[0x3887f16af8]
> /usr/lib64/glusterfs/1.3.12/xlator/protocol/server.so(server_setxattr+0x2b1)[0x2a958b7b21]
> /usr/lib64/glusterfs/1.3.12/xlator/protocol/server.so(server_protocol_interpret+0x2f5)[0x2a958bc985]
> /usr/lib64/glusterfs/1.3.12/xlator/protocol/server.so(notify+0xef)[0x2a958bd5ef]
> /usr/lib64/libglusterfs.so.0(sys_epoll_iteration+0xe1)[0x3887f113c1]
> /usr/lib64/libglusterfs.so.0(poll_iteration+0x4a)[0x3887f10b0a]
> [glusterfs](main+0x418)[0x402658]
> /lib64/tls/libc.so.6(__libc_start_main+0xdb)[0x3d09b1c40b]
> [glusterfs][0x401d8a]
> ---------

Just for testing i reduced the setup from a tree to a line of kind 
(Node A) AFR -> (Node B) AFR -> (Node C) 
with just 3 servers, but it doesn't even want to run in this configuration.
When i take out one of the AFRs and just put in tcp/client it works.
Interestingly enough, when gluster is already running on Node A, and i 
start gluster on Node B, everything works like i would like to have it.
But i can't really restart everything on Level B whenever a new client
is added to that FS.

If you have kept on reading until here, thanks for your attention and 
sorry for the long winded mail.

Here are the technical details of the software involved:
GlusterFS: version 1.3.12 on all nodes.
OS       : Linux 2.6.9-78.0.1 on Intel x86_64
Fuse     : version 2.6.3-2
The interconnect between nodes is TCP/IP over Ethernet.

Thanks for your help and cheers,

  Rainer

*****************************************************
* Rainer Schwemmer                                  *
*                                                   *
* PH Division, CERN                                 *
* LHCb Experiment                                   *
* CH-1211 Geneva 23                                 *
* Telephone: [41] 22 767 31 25                      *
* Fax:       [41] 22 767 94 25                      *
* E-mail:    mailto:rainer.schwemmer at NOSPAM.cern.ch *
*****************************************************