[Gluster-devel] Weird lock-ups

Tue Oct 21 12:24:19 UTC 2008

I'm starting to see lock-ups when using a single-file client/server setup.

machine1 (x86): =================================
volume home2
         type protocol/client
         option transport-type tcp/client
         option remote-host 192.168.3.1
         option remote-subvolume home2
end-volume

volume home-store
         type storage/posix
         option directory /gluster/home
end-volume

volume home1
         type features/posix-locks
         subvolumes home-store
end-volume

volume server
         type protocol/server
         option transport-type tcp/server
         subvolumes home1
         option auth.ip.home1.allow 127.0.0.1,192.168.*
end-volume

volume home
         type cluster/afr
         subvolumes home1 home2
         option read-subvolume home1
end-volume

machine2 (x86-64): =================================
volume home1
         type protocol/client
         option transport-type tcp/client
         option remote-host 192.168.0.1
         option remote-subvolume home1
end-volume

volume home-store
         type storage/posix
         option directory /gluster/home
end-volume

volume home2
         type features/posix-locks
         subvolumes home-store
end-volume

volume server
         type protocol/server
         option transport-type tcp/server
         subvolumes home2
         option auth.ip.home2.allow 127.0.0.1,192.168.*
end-volume

volume home
         type cluster/afr
         subvolumes home1 home2
         option read-subvolume home2
end-volume

==================

Do those configs look sane?

When one machine is running on it's own, it's fine. Other client-only 
machines can connect to it without any problems. However, as soon as the 
second client/server comes up, typically the first ls access on the 
directory will lock the whole thing up solid.

Interestingly, on the x86 machine, the glusterfs process can always be 
killed. Not so on the x86-64 machine (the 2nd machine that comes up). 
kill -9 doesn't kill it. The only way to clear the lock-up is to reboot.

Using the 1.3.12 release compiled into an RPM on both machines (CentOS 5.2).

One thing worthy of note is that machine2 is nfsrooted / network booted. 
It has local disks in it, and a local dmraid volume is mounted under 
/gluster on it (machine1 has a disk-backed root).

So, on machine1:
/ is local disk
on machine2:
/ is NFS
/gluster is local disk
/gluster/home is exported in the volume spec for AFR.

If /gluster/home is newly created, it tends to get a little further, but 
still locks up pretty quickly. If I try to execute find /home once it is 
mounted, it will eventually hang, and the only thing of note I could see 
in the logs is that it said "active lock found" at the point where it 
locked-up. Once it locks up, both sides lose access to the FS, machine2 
needs a reboot, but machine1 can get away with just killing the 
glusterfs process.

I tried splitting the config up into a separate client and server on 
machine2 (machine1 still using a single file), but the problem persists.

Am I doing something wrong, or have I run into a bug?

Thanks.

Gordan