[Gluster-devel] Weird lock-ups

Mon Oct 27 18:58:28 UTC 2008

Krishna Srinivas wrote:
> On Tue, Oct 21, 2008 at 5:54 PM, Gordan Bobic <gordan at bobich.net> wrote:
>> I'm starting to see lock-ups when using a single-file client/server setup.
>>
>> machine1 (x86): =================================
>> volume home2
>>        type protocol/client
>>        option transport-type tcp/client
>>        option remote-host 192.168.3.1
>>        option remote-subvolume home2
>> end-volume
>>
>> volume home-store
>>        type storage/posix
>>        option directory /gluster/home
>> end-volume
>>
>> volume home1
>>        type features/posix-locks
>>        subvolumes home-store
>> end-volume
>>
>> volume server
>>        type protocol/server
>>        option transport-type tcp/server
>>        subvolumes home1
>>        option auth.ip.home1.allow 127.0.0.1,192.168.*
>> end-volume
>>
>> volume home
>>        type cluster/afr
>>        subvolumes home1 home2
>>        option read-subvolume home1
>> end-volume
>>
>> machine2 (x86-64): =================================
>> volume home1
>>        type protocol/client
>>        option transport-type tcp/client
>>        option remote-host 192.168.0.1
>>        option remote-subvolume home1
>> end-volume
>>
>> volume home-store
>>        type storage/posix
>>        option directory /gluster/home
>> end-volume
>>
>> volume home2
>>        type features/posix-locks
>>        subvolumes home-store
>> end-volume
>>
>> volume server
>>        type protocol/server
>>        option transport-type tcp/server
>>        subvolumes home2
>>        option auth.ip.home2.allow 127.0.0.1,192.168.*
>> end-volume
>>
>> volume home
>>        type cluster/afr
>>        subvolumes home1 home2
>>        option read-subvolume home2
>> end-volume
>>
>> ==================
>>
>> Do those configs look sane?
>>
>> When one machine is running on it's own, it's fine. Other client-only
>> machines can connect to it without any problems. However, as soon as the
>> second client/server comes up, typically the first ls access on the
>> directory will lock the whole thing up solid.
>>
>> Interestingly, on the x86 machine, the glusterfs process can always be
>> killed. Not so on the x86-64 machine (the 2nd machine that comes up). kill
>> -9 doesn't kill it. The only way to clear the lock-up is to reboot.
>>
>> Using the 1.3.12 release compiled into an RPM on both machines (CentOS 5.2).
>>
>> One thing worthy of note is that machine2 is nfsrooted / network booted. It
>> has local disks in it, and a local dmraid volume is mounted under /gluster
>> on it (machine1 has a disk-backed root).
>>
>> So, on machine1:
>> / is local disk
>> on machine2:
>> / is NFS
>> /gluster is local disk
>> /gluster/home is exported in the volume spec for AFR.
>>
>> If /gluster/home is newly created, it tends to get a little further, but
>> still locks up pretty quickly. If I try to execute find /home once it is
>> mounted, it will eventually hang, and the only thing of note I could see in
>> the logs is that it said "active lock found" at the point where it
> 
> Do you see this error on server1 or server2? Any other clues in the logs?

Access to the FS locks up on both server1 and server2.

I have split up the setup to separate cliend and server on server2 
(x86-64), and have tried to get it to sync up just the file placeholders 
(find . at the root of the glusterfs mounted tree), and this, too causes 
a lock-up. I have managed to kill the glusterfsd process, but only after 
killing the glusterfs process first.

This ends up in the logs on server2, in the glusterfs (client) log:
2008-10-27 18:44:31 C [client-protocol.c:212:call_bail] home2: bailing 
transport
2008-10-27 18:44:31 E [client-protocol.c:4834:client_protocol_cleanup] 
home2: forced unwinding frame type(1) op(36) reply=@0x612b70
2008-10-27 18:44:31 E [client-protocol.c:4215:client_setdents_cbk] 
home2: no proper reply from server, returning ENOTCONN
2008-10-27 18:44:31 E [afr_self_heal.c:155:afr_lds_setdents_cbk] mirror: 
op_ret=-1 op_errno=107
2008-10-27 18:44:31 E [client-protocol.c:4834:client_protocol_cleanup] 
home2: forced unwinding frame type(1) op(34) reply=@0x612b70
2008-10-27 18:44:31 E [client-protocol.c:4430:client_lookup_cbk] home2: 
no proper reply from server, returning ENOTCONN
2008-10-27 18:44:31 E [fuse-bridge.c:468:fuse_entry_cbk] glusterfs-fuse: 
19915: (34) /gordan/bin => -1 (5)
2008-10-27 18:45:51 C [client-protocol.c:212:call_bail] home2: bailing 
transport
2008-10-27 18:45:51 E [client-protocol.c:4834:client_protocol_cleanup] 
home2: forced unwinding frame type(1) op(0) reply=@0x9ae090
2008-10-27 18:45:51 E [client-protocol.c:2688:client_stat_cbk] home2: no 
proper reply from server, returning ENOTCONN
2008-10-27 18:45:51 E [afr.c:3298:afr_stat_cbk] mirror: (child=home2) 
op_ret=-1 op_errno=107
2008-10-27 18:45:51 E [client-protocol.c:4834:client_protocol_cleanup] 
home2: forced unwinding frame type(1) op(34) reply=@0x9ae090
2008-10-27 18:45:51 E [client-protocol.c:4430:client_lookup_cbk] home2: 
no proper reply from server, returning ENOTCONN
2008-10-27 18:45:51 E [client-protocol.c:325:client_protocol_xfer] 
home2: transport_submit failed
2008-10-27 18:45:51 E [client-protocol.c:325:client_protocol_xfer] 
home2: transport_submit failed
2008-10-27 18:45:51 E [client-protocol.c:4834:client_protocol_cleanup] 
home2: forced unwinding frame type(1) op(34) reply=@0x9ae090
2008-10-27 18:45:51 E [client-protocol.c:4430:client_lookup_cbk] home2: 
no proper reply from server, returning ENOTCONN
2008-10-27 18:46:23 E [protocol.c:271:gf_block_unserialize_transport] 
home1: EOF from peer (192.168.0.1:6996)
2008-10-27 18:46:23 E [client-protocol.c:4834:client_protocol_cleanup] 
home1: forced unwinding frame type(2) op(5) reply=@0x2aaaac00
1230
2008-10-27 18:46:23 E [client-protocol.c:4246:client_lock_cbk] home1: no 
proper reply from server, returning ENOTCONN

I think this was generated in the logs only after the server2 client was 
forcefully killed, not when the lock-up occured, though.

If I merge the client and server config into a single volume definition 
on server2, the lock-up happens as soon as the FS is mounted. If 
server2-server gets brought up first, the server1-combined, then 
server2-client, it seems to last a bit longer.

I'm wondering now if it fails on a particular file/file type (e.g. a 
socket).

But whatever is causing it, it is completely reproducible. I haven't 
been able to keep it running under these circumstances for long enough 
to finish loading X with the home directory mounted over glusterfs with 
both servers running.

Gordan