[Gluster-devel] AFR: machine crash hangs other mounts or transport endpoint not connected

Tue Apr 29 12:44:57 UTC 2008

Thanks Krishna!  I moved the setup out of the diskless boot cluster and am
able to reproduce on regular machines. All version and config information is
below. The scenario is:

1. Start glusterfsd on both servers
2. On client, mount gluster at /mnt/gluster
3. Run little testing script on client to show status of the mount:
#!/bin/bash
while true
 do
   echo $(date)
   sleep 1
   cat /mnt/gluster/etc/fstab
 done
4-A. ifconfig down on server1 - client logs no errors, no delays (must be
reading from server2)
4-B  ifconfig down on server2 - first time I tried = recovery in 5 seconds
4-B  "                      " - 2nd and 3rd times = client hangs until I
manually kill the process
4-C. Hard power off on server1 - client logs no errors, no delays
4-D. Hard power off on server2, client hangs until I manually kill the
process

The client logs the following in situation 4-B during a hang:

2008-04-29 08:36:50 W [client-protocol.c:204:call_bail] master1: activating
bail-out. pending frames = 1. last sent = 2008-04-29 08:36:41. last received
= 2008-04-29 08:36:40 transport-timeout = 5
2008-04-29 08:36:50 C [client-protocol.c:211:call_bail] master1: bailing
transport
2008-04-29 08:36:50 W [client-protocol.c:204:call_bail] master2: activating
bail-out. pending frames = 1. last sent = 2008-04-29 08:36:41. last received
= 2008-04-29 08:36:40 transport-timeout = 5
2008-04-29 08:36:50 C [client-protocol.c:211:call_bail] master2: bailing
transport
2008-04-29 08:36:50 W [client-protocol.c:4759:client_protocol_cleanup]
master2: cleaning up state in transport object 0x8e11968
2008-04-29 08:36:50 E [client-protocol.c:4809:client_protocol_cleanup]
master2: forced unwinding frame type(1) op(34) reply=@0x8e4a8e0
2008-04-29 08:36:50 E [client-protocol.c:4405:client_lookup_cbk] master2: no
proper reply from server, returning ENOTCONN
2008-04-29 08:36:50 W [client-protocol.c:4759:client_protocol_cleanup]
master1: cleaning up state in transport object 0x8e103b8
2008-04-29 08:36:50 E [client-protocol.c:4809:client_protocol_cleanup]
master1: forced unwinding frame type(1) op(34) reply=@0x8e4a9b0
2008-04-29 08:36:50 E [client-protocol.c:4405:client_lookup_cbk] master1: no
proper reply from server, returning ENOTCONN
2008-04-29 08:36:50 E [fuse-bridge.c:459:fuse_entry_cbk] glusterfs-fuse:
362: (34) /etc => -1 (107)
2008-04-29 08:36:50 E [client-protocol.c:324:client_protocol_xfer] master1:
transport_submit failed
2008-04-29 08:36:50 W [client-protocol.c:331:client_protocol_xfer] master2:
not connected at the moment to submit frame type(1) op(34)
2008-04-29 08:36:50 E [client-protocol.c:4405:client_lookup_cbk] master2: no
proper reply from server, returning ENOTCONN
2008-04-29 08:37:43 E [tcp-client.c:190:tcp_connect] master2: non-blocking
connect() returned: 113 (No route to host)
2008-04-29 08:37:43 E [tcp-client.c:190:tcp_connect] master1: non-blocking
connect() returned: 113 (No route to host)
2008-04-29 08:39:13 E [tcp-client.c:190:tcp_connect] master2: non-blocking
connect() returned: 113 (No route to host)
2008-04-29 08:39:13 E [tcp-client.c:190:tcp_connect] master1: non-blocking
connect() returned: 113 (No route to host)

Version:
glusterfs 1.3.8pre6 built on Apr 28 2008 21:20:10
Repository revision: glusterfs--mainline--2.5--patch-748

Config file on server1 and server2:
-------------------------
volume storage1
  type storage/posix                   # POSIX FS translator
  option directory /        # Export this directory
end-volume
#
volume brick-ns
  type storage/posix
  option directory /ns
end-volume
#
volume server
  type protocol/server
  option transport-type tcp/server     # For TCP/IP transport
  subvolumes storage1
  option auth.ip.storage1.allow 192.168.20.* # Allow access to "storage1"
volume
end-volume
-------------------------

Config file on client:
-------------------------
volume master1
  type protocol/client
  option transport-type tcp/client     # for TCP/IP transport
  option remote-host 192.168.20.140       # IP address of the remote brick
  option transport-timeout 5
  option remote-subvolume storage1      # name of the remote volume
end-volume

volume master2
  type protocol/client
  option transport-type tcp/client     # for TCP/IP transport
  option remote-host 192.168.20.141       # IP address of the remote brick
  option transport-timeout 5
  option remote-subvolume storage1        # name of the remote volume
end-volume

volume data-afr
  type cluster/afr
  subvolumes master1 master2
end-volume
-------------------------

> Gerry, Christopher,
> 
> Here is what I tried to do. Two servers, one client, simple 
> setup, afr on the client side. I did "ls" on client mount 
> point, it works, now I do "ifconfig eth0 down"
> on the server, next I do "ls" on client, it hangs for 10 secs 
> (timeout value) and fails over and starts working again 
> without any problem.
> 
> I guess few users are facing the problem you guys are facing. 
> Can you give us your setup details and mention the exact 
> steps to reproduce. Also try to come up with minimal config 
> details which can still reproduce the problem
> 
> Thanks!
> Krishna
> 
> On Sat, Apr 26, 2008 at 7:01 AM, Christopher Hawkins 
> <chawkins at veracitynetworks.com> wrote:
> > I am having the same issue. I'm working on a diskless  node cluster 
> > and figured the issue was related to that  since AFR seems to fail 
> > over nicely for everyone else...
> >  But it seems I am not alone, so what can I do to help troubleshoot?
> >
> >  I have two servers exporting a brick each, and a client mounting  
> > them both with AFR and no unify. Transport timeout settings  don't 
> > seem to make a difference - client is just hung if I power off  or 
> > just stop glusterfsd. There is nothing logged on the server side.
> >  I'll use a usb thumb drive for client side logging since 
> any logs in  
> > the ramdisk obviously disappear after the reboot which 
> fixes the hang...
> >  If I get any insight from this I'll report it asap.
> >
> >  Thanks,
> >  Chris
> >
> >
> >