[Gluster-devel] AFR: machine crash hangs other mountsortransportendpoint not connected

Wed Apr 30 13:35:08 UTC 2008

Gluster devs, 

I am still not able to keep the client from hanging in a diskless cluster
node. When I fail a server the client becomes unresponsive and does not read
from the other AFR volume. I first moved the entire /lib and /bin and /sbin
directories into the ramdisk which runs the nodes to rule out the simple
loss of an odd binary or library... An lsof | grep gluster on the client
(pre-failover test) shows:

[root at node1 ~]# lsof |grep gluster
glusterfs 2195   root  cwd       DIR        0,1        0         2 /
glusterfs 2195   root  rtd       DIR        0,1        0         2 /
glusterfs 2195   root  txt       REG        0,1    55592      3863
/bin/glusterfs
glusterfs 2195   root  mem       REG        0,1   341068      2392
/lib/libfuse.so.2.7.2
glusterfs 2195   root  mem       REG        0,1   118096      2505
/lib/glusterfs/1.3.8pre6/xlator/mount/fuse.so
glusterfs 2195   root  mem       REG        0,1   164703      2514
/lib/glusterfs/1.3.8pre6/xlator/protocol/client.so
glusterfs 2195   root  mem       REG        0,1   112168        77
/lib/ld-2.3.4.so
glusterfs 2195   root  mem       REG        0,1  1529120      2483
/lib/tls/libc-2.3.4.so
glusterfs 2195   root  mem       REG        0,1    16732        70
/lib/libdl-2.3.4.so
glusterfs 2195   root  mem       REG        0,1   107800      2485
/lib/tls/libpthread-2.3.4.so
glusterfs 2195   root  mem       REG        0,1    43645      2533
/lib/glusterfs/1.3.8pre6/transport/tcp/client.so
glusterfs 2195   root  mem       REG        0,1   427763      2456
/lib/libglusterfs.so.0.0.0
glusterfs 2195   root  mem       REG        0,1    50672      2474
/lib/tls/librt-2.3.4.so
glusterfs 2195   root  mem       REG        0,1   245686      2522
/lib/glusterfs/1.3.8pre6/xlator/cluster/afr.so
glusterfs 2195   root    0u      CHR        1,3               3393 /dev/null
glusterfs 2195   root    1u      CHR        1,3               3393 /dev/null
glusterfs 2195   root    2u      CHR        1,3               3393 /dev/null
glusterfs 2195   root    3w      REG        0,1      102      4495
/var/log/glusterfs/glusterfs.log
glusterfs 2195   root    4u      CHR     10,229               3494 /dev/fuse
glusterfs 2195   root    5r     0000        0,8        0      4498 eventpoll
glusterfs 2195   root    6u     IPv4       4499                TCP
192.168.20.155:1023->master1:6996 (ESTABLISHED)
glusterfs 2195   root    7u     IPv4       4500                TCP
192.168.20.155:1022->master2:6996 (ESTABLISHED)

Everything listed here is a local file and the gluster binary has access to
them during failover. Can you help me troubleshoot by explaining what
exactly gluster is doing when it loses a connection? Does it depend on
something I have missed? This failover test uses the same config files and
binaries that my earlier tests use (which succeeded, but were not run on a
diskless node). There must be something else in the filesystem that
glusterfs requires to failover successfully?

Thanks, 
Chris

> > > 
> > >  
> > > > Gerry, Christopher,
> > > > 
> > > > Here is what I tried to do. Two servers, one client, simple
> > > setup, afr
> > > > on the client side. I did "ls" on client mount point, it
> > > works, now I
> > > > do "ifconfig eth0 down"
> > > > on the server, next I do "ls" on client, it hangs for 10
> > > secs (timeout
> > > > value) and fails over and starts working again without
> > any problem.
> > > > 
> > > > I guess few users are facing the problem you guys are facing. 
> > > > Can you give us your setup details and mention the 
> exact steps to 
> > > > reproduce. Also try to come up with minimal config details
> > > which can
> > > > still reproduce the problem
> > > > 
> > > > Thanks!
> > > > Krishna
> > > > 
> > > > On Sat, Apr 26, 2008 at 7:01 AM, Christopher Hawkins 
> > > > <chawkins at veracitynetworks.com> wrote:
> > > > > I am having the same issue. I'm working on a diskless
> > > node cluster
> > > > > and figured the issue was related to that  since AFR
> > > seems to fail
> > > > > over nicely for everyone else...
> > > > >  But it seems I am not alone, so what can I do to help
> > > troubleshoot?
> > > > >
> > > > >  I have two servers exporting a brick each, and a
> > client mounting
> > > > > them both with AFR and no unify. Transport timeout
> > > settings  don't
> > > > > seem to make a difference - client is just hung if I
> > > power off  or
> > > > > just stop glusterfsd. There is nothing logged on the
> > server side.
> > > > >  I'll use a usb thumb drive for client side logging since
> > > > any logs in
> > > > > the ramdisk obviously disappear after the reboot which
> > > > fixes the hang...
> > > > >  If I get any insight from this I'll report it asap.
> > > > >
> > > > >  Thanks,
> > > > >  Chris
> > > > >
> > > > >
> > > > >
> > > 
> > > 
> > > 
> > > _______________________________________________
> > > Gluster-devel mailing list
> > > Gluster-devel at nongnu.org
> > > http://lists.nongnu.org/mailman/listinfo/gluster-devel
> > > 
> > 
> > 
> > 
> > _______________________________________________
> > Gluster-devel mailing list
> > Gluster-devel at nongnu.org
> > http://lists.nongnu.org/mailman/listinfo/gluster-devel
> > 
> 
> 
> 
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at nongnu.org
> http://lists.nongnu.org/mailman/listinfo/gluster-devel
>