[Gluster-users] The continuing story ...
Mark Mielke
mark at mark.mielke.cc
Sat Sep 19 01:33:00 UTC 2009
On 09/18/2009 03:37 PM, Mark Mielke wrote:
> On 09/18/2009 02:28 PM, Anand Avati wrote:
>>> For me, it does not clear after 3 mins or 3 hours. I restarted the
>>> machines
>>> at midnight, and the first time I tried again was around 1pm the
>>> next day
>>> (13 hours). I easily recognize the symptoms as the /bin/mount
>>> remains in the
>>> process tree. I can't get a strace -p on the /bin/mount process
>>> since it is
>>> frozen. The glusterfsd process is not frozen - the glusterfs process
>>> seems
>>> to be waiting on /bin/mount to complete. The only way to unfreeze
>>> the mount
>>> seems to be to kill -9 /bin/mount (regular kill does not work), at
>>> which the
>>> mount point goes into the disconnected state, and it is recovered using
>>> unmount / remount. I tried to track down the problem before, but became
>>> confused, because glusterfs seems to do it's own FUSE mount management
>>> rather than using the standard (for Linux anyways?) FUSE user space
>>> libraries. If my memory is correct - it seems like the process is: I
>>> run
>>> mount, the mount runs /sbin/mount.glusterfs, which runs glusterfs,
>>> which
>>> runs /bin/mount with the full options?
>> This looks like a different issue from what I previously described. If
>> you are certain that the /bin/mount which was hung was the one which
>> glusterfs had spawned, then the issue might be something else. The way
>> fuse based filesystems mount is 2-fold. The first 'mount -t glusterfs'
>> starts /bin/mount which in turn calls /sbin/mount.glusterfs. This
>> starts the glusterfs binary, which at the time of initializing the
>> fuse xlator results in a call to fuse_mount() call of libfuse. libfuse
>> in-turn does the second phase of mounting by calling mount -t fuse an
>> in turn /sbin/mount.fuse. I'm trying to think how the three machines
>> rebooting together can be correlated to the second phase fuse mount to
>> hang.
>
> Thanks for looking at this. The above is compatible with my thinking.
> I'll see about getting output to prove it.
[root at wcarh033]~# ps -ef | grep gluster
root 1548 1 0 21:00 ? 00:00:00
/opt/glusterfs/sbin/glusterfsd -f /etc/glusterfs/glusterfsd.vol
root 1861 1 0 21:00 ? 00:00:00
/opt/glusterfs/sbin/glusterfs --log-level=NORMAL
--volfile=/etc/glusterfs/tools.vol /gluster/tools
root 1874 1861 0 21:00 ? 00:00:00 /bin/mount -i -f -t
fuse.glusterfs -o rw,allow_other,default_permissions,max_read=131072
/etc/glusterfs/tools.vol /gluster/tools
root 2426 2395 0 21:02 pts/2 00:00:00 grep gluster
[root at wcarh033]~# ls /gluster/tools
^C^C
Yep - all three nodes locked up. All it took was a simultaneous reboot
of all three machines.
After I kill -9 1874 (kill 1874 without -9 has no effect) from a
different ssh session, I get:
ls: cannot access /gluster/tools: Transport endpoint is not connected
After this, mount works (unmount not necessary it turns out).
I am unable to strace -p the mount -t fuse without it freezing up. I can
pstack, but it returns 0 lines of output fairly quickly.
The symptoms are identical on all three machines. 3-way replication,
each server has both a server exposing one volume, and a client, with
cluster/replication and a preferred read of the local server.
Cheers,
mark
--
Mark Mielke<mark at mielke.cc>
More information about the Gluster-users
mailing list