[Gluster-devel] AFR: machine crash hangs other mountsortransportendpoint not connected

Thu May 8 10:35:23 UTC 2008

Chris,
Do you see clues in the log files?
Krishna

On Wed, Apr 30, 2008 at 8:22 PM, Anand Avati <avati at zresearch.com> wrote:
> Chris,
>   can you get the glusterfs client logs from your ramdisk when the servers
>  are being pulled out and tried to access the mount point?
>
>
>
>  avati
>
>  2008/4/30 Christopher Hawkins <chawkins at veracitynetworks.com>:
>
>  > Without. All that is removed...
>  >
>  >
>  >  _____
>  >
>  > From: anand.avati at gmail.com [mailto:anand.avati at gmail.com] On Behalf Of
>  > Anand Avati
>  > Sent: Wednesday, April 30, 2008 10:24 AM
>  > To: Christopher Hawkins
>  > Cc: gluster-devel at nongnu.org
>  > Subject: Re: [Gluster-devel] AFR: machine crash hangs other
>  > mountsortransportendpoint not connected
>  >
>  >
>  > Chris,
>  >  is this hang with IP failover in place or without?
>  >
>  > avati
>  >
>  >
>  > 2008/4/30 Christopher Hawkins <chawkins at veracitynetworks.com>:
>  >
>  >
>  >
>  > Gluster devs,
>  >
>  > I am still not able to keep the client from hanging in a diskless cluster
>  > node. When I fail a server the client becomes unresponsive and does not
>  > read
>  > from the other AFR volume. I first moved the entire /lib and /bin and
>  > /sbin
>  > directories into the ramdisk which runs the nodes to rule out the simple
>  > loss of an odd binary or library... An lsof | grep gluster on the client
>  > (pre-failover test) shows:
>  >
>  > [root at node1 ~]# lsof |grep gluster
>  > glusterfs 2195   root  cwd       DIR        0,1        0         2 /
>  > glusterfs 2195   root  rtd       DIR        0,1        0         2 /
>  > glusterfs 2195   root  txt       REG        0,1    55592      3863
>  > /bin/glusterfs
>  > glusterfs 2195   root  mem       REG        0,1   341068      2392
>  > /lib/libfuse.so.2.7.2
>  > glusterfs 2195   root  mem       REG        0,1   118096      2505
>  > /lib/glusterfs/1.3.8pre6/xlator/mount/fuse.so
>  > glusterfs 2195   root  mem       REG        0,1   164703      2514
>  > /lib/glusterfs/1.3.8pre6/xlator/protocol/client.so
>  > glusterfs 2195   root  mem       REG        0,1   112168        77
>  > /lib/ld-2.3.4.so
>  > glusterfs 2195   root  mem       REG        0,1  1529120      2483
>  > /lib/tls/libc-2.3.4.so
>  > glusterfs 2195   root  mem       REG        0,1    16732        70
>  > /lib/libdl-2.3.4.so
>  > glusterfs 2195   root  mem       REG        0,1   107800      2485
>  > /lib/tls/libpthread-2.3.4.so
>  > glusterfs 2195   root  mem       REG        0,1    43645      2533
>  > /lib/glusterfs/1.3.8pre6/transport/tcp/client.so
>  > glusterfs 2195   root  mem       REG        0,1   427763      2456
>  > /lib/libglusterfs.so.0.0.0
>  > glusterfs 2195   root  mem       REG        0,1    50672      2474
>  > /lib/tls/librt-2.3.4.so
>  > glusterfs 2195   root  mem       REG        0,1   245686      2522
>  > /lib/glusterfs/1.3.8pre6/xlator/cluster/afr.so
>  > glusterfs 2195   root    0u      CHR        1,3               3393
>  > /dev/null
>  > glusterfs 2195   root    1u      CHR        1,3               3393
>  > /dev/null
>  > glusterfs 2195   root    2u      CHR        1,3               3393
>  > /dev/null
>  > glusterfs 2195   root    3w      REG        0,1      102      4495
>  > /var/log/glusterfs/glusterfs.log
>  > glusterfs 2195   root    4u      CHR     10,229               3494
>  > /dev/fuse
>  > glusterfs 2195   root    5r     0000        0,8        0      4498
>  > eventpoll
>  > glusterfs 2195   root    6u     IPv4       4499                TCP
>  > 192.168.20.155:1023->master1:6996 (ESTABLISHED)
>  > glusterfs 2195   root    7u     IPv4       4500                TCP
>  > 192.168.20.155:1022->master2:6996 (ESTABLISHED)
>  >
>  > Everything listed here is a local file and the gluster binary has access
>  > to
>  > them during failover. Can you help me troubleshoot by explaining what
>  > exactly gluster is doing when it loses a connection? Does it depend on
>  > something I have missed? This failover test uses the same config files and
>  > binaries that my earlier tests use (which succeeded, but were not run on a
>  > diskless node). There must be something else in the filesystem that
>  > glusterfs requires to failover successfully?
>  >
>  > Thanks,
>  > Chris
>  >
>  >
>  > > > >
>  > > > >
>  > > > > > Gerry, Christopher,
>  > > > > >
>  > > > > > Here is what I tried to do. Two servers, one client, simple
>  > > > > setup, afr
>  > > > > > on the client side. I did "ls" on client mount point, it
>  > > > > works, now I
>  > > > > > do "ifconfig eth0 down"
>  > > > > > on the server, next I do "ls" on client, it hangs for 10
>  > > > > secs (timeout
>  > > > > > value) and fails over and starts working again without
>  > > > any problem.
>  > > > > >
>  > > > > > I guess few users are facing the problem you guys are facing.
>  > > > > > Can you give us your setup details and mention the
>  > > exact steps to
>  > > > > > reproduce. Also try to come up with minimal config details
>  > > > > which can
>  > > > > > still reproduce the problem
>  > > > > >
>  > > > > > Thanks!
>  > > > > > Krishna
>  > > > > >
>  > > > > > On Sat, Apr 26, 2008 at 7:01 AM, Christopher Hawkins
>  > > > > > <chawkins at veracitynetworks.com> wrote:
>  > > > > > > I am having the same issue. I'm working on a diskless
>  > > > > node cluster
>  > > > > > > and figured the issue was related to that  since AFR
>  > > > > seems to fail
>  > > > > > > over nicely for everyone else...
>  > > > > > >  But it seems I am not alone, so what can I do to help
>  > > > > troubleshoot?
>  > > > > > >
>  > > > > > >  I have two servers exporting a brick each, and a
>  > > > client mounting
>  > > > > > > them both with AFR and no unify. Transport timeout
>  > > > > settings  don't
>  > > > > > > seem to make a difference - client is just hung if I
>  > > > > power off  or
>  > > > > > > just stop glusterfsd. There is nothing logged on the
>  > > > server side.
>  > > > > > >  I'll use a usb thumb drive for client side logging since
>  > > > > > any logs in
>  > > > > > > the ramdisk obviously disappear after the reboot which
>  > > > > > fixes the hang...
>  > > > > > >  If I get any insight from this I'll report it asap.
>  > > > > > >
>  > > > > > >  Thanks,
>  > > > > > >  Chris
>  > > > > > >
>  > > > > > >
>  > > > > > >
>  > > > >
>  > > > >
>  > > > >
>  > > > > _______________________________________________
>  > > > > Gluster-devel mailing list
>  > > > > Gluster-devel at nongnu.org
>  > > > > http://lists.nongnu.org/mailman/listinfo/gluster-devel
>  > > > >
>  > > >
>  > > >
>  > > >
>  > > > _______________________________________________
>  > > > Gluster-devel mailing list
>  > > > Gluster-devel at nongnu.org
>  > > > http://lists.nongnu.org/mailman/listinfo/gluster-devel
>  > > >
>  > >
>  > >
>  > >
>  > > _______________________________________________
>  > > Gluster-devel mailing list
>  > > Gluster-devel at nongnu.org
>  > > http://lists.nongnu.org/mailman/listinfo/gluster-devel
>  > >
>  >
>  >
>  >
>  > _______________________________________________
>  > Gluster-devel mailing list
>  > Gluster-devel at nongnu.org
>  > http://lists.nongnu.org/mailman/listinfo/gluster-devel
>  >
>  >
>  >
>  >
>  >
>  > --
>  > If I traveled to the end of the rainbow
>  > As Dame Fortune did intend,
>  > Murphy would be there to tell me
>  > The pot's at the other end.
>  >
>  > _______________________________________________
>  > Gluster-devel mailing list
>  > Gluster-devel at nongnu.org
>  > http://lists.nongnu.org/mailman/listinfo/gluster-devel
>  >
>
>
>
>  --
>  If I traveled to the end of the rainbow
>  As Dame Fortune did intend,
>  Murphy would be there to tell me
>  The pot's at the other end.
>  _______________________________________________
>  Gluster-devel mailing list
>  Gluster-devel at nongnu.org
>  http://lists.nongnu.org/mailman/listinfo/gluster-devel
>