[Gluster-users] The continuing story ...

Sat Sep 19 09:39:57 UTC 2009

> [root at wcarh033]~# ps -ef | grep gluster
> root      1548     1  0 21:00 ?        00:00:00 
> /opt/glusterfs/sbin/glusterfsd -f /etc/glusterfs/glusterfsd.vol
> root      1861     1  0 21:00 ?        00:00:00 
> /opt/glusterfs/sbin/glusterfs --log-level=NORMAL 
> --volfile=/etc/glusterfs/tools.vol /gluster/tools
> root      1874  1861  0 21:00 ?        00:00:00 /bin/mount -i -f -t 
> fuse.glusterfs -o rw,allow_other,default_permissions,max_read=131072 
> /etc/glusterfs/tools.vol /gluster/tools
> root      2426  2395  0 21:02 pts/2    00:00:00 grep gluster
> [root at wcarh033]~# ls /gluster/tools
> ^C^C
> 
> Yep - all three nodes locked up. All it took was a simultaneous reboot
> 
> of all three machines.
> 
> After I kill -9 1874 (kill 1874 without -9 has no effect) from a 
> different ssh session, I get:
> 
> ls: cannot access /gluster/tools: Transport endpoint is not connected
> 
> After this, mount works (unmount not necessary it turns out).
> 
> I am unable to strace -p the mount -t fuse without it freezing up. I
> can 
> pstack, but it returns 0 lines of output fairly quickly.
> 
> The symptoms are identical on all three machines. 3-way replication, 
> each server has both a server exposing one volume, and a client, with
> 
> cluster/replication and a preferred read of the local server.

This is a strange hang. I have a few more questions -

1. is this off glusterfs.git master branch or release-2.0? If this is master, there have been heavy un-QA'ed modifications to get rid of libfuse dependency.

2. what happens if you try to start the three daemons together now when the system is not booting? Is this hang somehow related to the system booting?

3. can you provide dmesg output and glusterfs trace level logs of this scenario?

Avati