[Gluster-devel] Weird lock-ups

Mon Oct 27 19:55:03 UTC 2008

On Mon, 27 Oct 2008, Gordan Bobic wrote:

> Krishna Srinivas wrote:
>>  On Tue, Oct 21, 2008 at 5:54 PM, Gordan Bobic <gordan at bobich.net> wrote:
>> >  I'm starting to see lock-ups when using a single-file client/server 
>> >  setup.
>> > 
>> >  machine1 (x86): =================================
>> >  volume home2
>> >         type protocol/client
>> >         option transport-type tcp/client
>> >         option remote-host 192.168.3.1
>> >         option remote-subvolume home2
>> >  end-volume
>> > 
>> >  volume home-store
>> >         type storage/posix
>> >         option directory /gluster/home
>> >  end-volume
>> > 
>> >  volume home1
>> >         type features/posix-locks
>> >         subvolumes home-store
>> >  end-volume
>> > 
>> >  volume server
>> >         type protocol/server
>> >         option transport-type tcp/server
>> >         subvolumes home1
>> >         option auth.ip.home1.allow 127.0.0.1,192.168.*
>> >  end-volume
>> > 
>> >  volume home
>> >         type cluster/afr
>> >         subvolumes home1 home2
>> >         option read-subvolume home1
>> >  end-volume
>> > 
>> >  machine2 (x86-64): =================================
>> >  volume home1
>> >         type protocol/client
>> >         option transport-type tcp/client
>> >         option remote-host 192.168.0.1
>> >         option remote-subvolume home1
>> >  end-volume
>> > 
>> >  volume home-store
>> >         type storage/posix
>> >         option directory /gluster/home
>> >  end-volume
>> > 
>> >  volume home2
>> >         type features/posix-locks
>> >         subvolumes home-store
>> >  end-volume
>> > 
>> >  volume server
>> >         type protocol/server
>> >         option transport-type tcp/server
>> >         subvolumes home2
>> >         option auth.ip.home2.allow 127.0.0.1,192.168.*
>> >  end-volume
>> > 
>> >  volume home
>> >         type cluster/afr
>> >         subvolumes home1 home2
>> >         option read-subvolume home2
>> >  end-volume
>> > 
>> >  ==================
>> > 
>> >  Do those configs look sane?
>> > 
>> >  When one machine is running on it's own, it's fine. Other client-only
>> >  machines can connect to it without any problems. However, as soon as the
>> >  second client/server comes up, typically the first ls access on the
>> >  directory will lock the whole thing up solid.
>> > 
>> >  Interestingly, on the x86 machine, the glusterfs process can always be
>> >  killed. Not so on the x86-64 machine (the 2nd machine that comes up). 
>> >  kill
>> >  -9 doesn't kill it. The only way to clear the lock-up is to reboot.
>> > 
>> >  Using the 1.3.12 release compiled into an RPM on both machines (CentOS 
>> >  5.2).
>> > 
>> >  One thing worthy of note is that machine2 is nfsrooted / network booted. 
>> >  It
>> >  has local disks in it, and a local dmraid volume is mounted under 
>> >  /gluster
>> >  on it (machine1 has a disk-backed root).
>> > 
>> >  So, on machine1:
>> >  / is local disk
>> >  on machine2:
>> >  / is NFS
>> >  /gluster is local disk
>> >  /gluster/home is exported in the volume spec for AFR.
>> > 
>> >  If /gluster/home is newly created, it tends to get a little further, but
>> >  still locks up pretty quickly. If I try to execute find /home once it is
>> >  mounted, it will eventually hang, and the only thing of note I could see 
>> >  in
>> >  the logs is that it said "active lock found" at the point where it
>>
>>  Do you see this error on server1 or server2? Any other clues in the logs?
>
> Access to the FS locks up on both server1 and server2.
>
> I have split up the setup to separate cliend and server on server2 (x86-64), 
> and have tried to get it to sync up just the file placeholders (find . at the 
> root of the glusterfs mounted tree), and this, too causes a lock-up. I have 
> managed to kill the glusterfsd process, but only after killing the glusterfs 
> process first.
>
> This ends up in the logs on server2, in the glusterfs (client) log:
> 2008-10-27 18:44:31 C [client-protocol.c:212:call_bail] home2: bailing 
> transport
> 2008-10-27 18:44:31 E [client-protocol.c:4834:client_protocol_cleanup] home2: 
> forced unwinding frame type(1) op(36) reply=@0x612b70
> 2008-10-27 18:44:31 E [client-protocol.c:4215:client_setdents_cbk] home2: no 
> proper reply from server, returning ENOTCONN
> 2008-10-27 18:44:31 E [afr_self_heal.c:155:afr_lds_setdents_cbk] mirror: 
> op_ret=-1 op_errno=107
> 2008-10-27 18:44:31 E [client-protocol.c:4834:client_protocol_cleanup] home2: 
> forced unwinding frame type(1) op(34) reply=@0x612b70
> 2008-10-27 18:44:31 E [client-protocol.c:4430:client_lookup_cbk] home2: no 
> proper reply from server, returning ENOTCONN
> 2008-10-27 18:44:31 E [fuse-bridge.c:468:fuse_entry_cbk] glusterfs-fuse: 
> 19915: (34) /gordan/bin => -1 (5)
> 2008-10-27 18:45:51 C [client-protocol.c:212:call_bail] home2: bailing 
> transport
> 2008-10-27 18:45:51 E [client-protocol.c:4834:client_protocol_cleanup] home2: 
> forced unwinding frame type(1) op(0) reply=@0x9ae090
> 2008-10-27 18:45:51 E [client-protocol.c:2688:client_stat_cbk] home2: no 
> proper reply from server, returning ENOTCONN
> 2008-10-27 18:45:51 E [afr.c:3298:afr_stat_cbk] mirror: (child=home2) 
> op_ret=-1 op_errno=107
> 2008-10-27 18:45:51 E [client-protocol.c:4834:client_protocol_cleanup] home2: 
> forced unwinding frame type(1) op(34) reply=@0x9ae090
> 2008-10-27 18:45:51 E [client-protocol.c:4430:client_lookup_cbk] home2: no 
> proper reply from server, returning ENOTCONN
> 2008-10-27 18:45:51 E [client-protocol.c:325:client_protocol_xfer] home2: 
> transport_submit failed
> 2008-10-27 18:45:51 E [client-protocol.c:325:client_protocol_xfer] home2: 
> transport_submit failed
> 2008-10-27 18:45:51 E [client-protocol.c:4834:client_protocol_cleanup] home2: 
> forced unwinding frame type(1) op(34) reply=@0x9ae090
> 2008-10-27 18:45:51 E [client-protocol.c:4430:client_lookup_cbk] home2: no 
> proper reply from server, returning ENOTCONN
> 2008-10-27 18:46:23 E [protocol.c:271:gf_block_unserialize_transport] home1: 
> EOF from peer (192.168.0.1:6996)
> 2008-10-27 18:46:23 E [client-protocol.c:4834:client_protocol_cleanup] home1: 
> forced unwinding frame type(2) op(5) reply=@0x2aaaac00
> 1230
> 2008-10-27 18:46:23 E [client-protocol.c:4246:client_lock_cbk] home1: no 
> proper reply from server, returning ENOTCONN
>
> I think this was generated in the logs only after the server2 client was 
> forcefully killed, not when the lock-up occured, though.
>
> If I merge the client and server config into a single volume definition on 
> server2, the lock-up happens as soon as the FS is mounted. If server2-server 
> gets brought up first, the server1-combined, then server2-client, it seems to 
> last a bit longer.
>
> I'm wondering now if it fails on a particular file/file type (e.g. a socket).
>
> But whatever is causing it, it is completely reproducible. I haven't been 
> able to keep it running under these circumstances for long enough to finish 
> loading X with the home directory mounted over glusterfs with both servers 
> running.

Update - the problem seems to be somehow linked to running the client on 
server2. If I start up server2-server, and server1 client+server, I can 
execute a complete "find ." on the gluster mounted volume (from server1, 
obviously, server2 doesn't have a client running), and instigate a 
full resync by the usual "find /home -type f -exec head -c1 {} \; > 
/dev/null". This all works, and all files end up on server2.

But doing this with client up and running on server2 makes the whole 
process lock up. Sometimes I can only get the first "ls -la" on the base 
of the mounted tree before everything subsequent locks up and ends up 
waiting until I "killall glusterfs" on server2. At this point glusterfsd 
(server) on server2 is unkillable until glusterfs (client) is killed 
first.

I have just completed a full rescan of the underlying file system on 
server1 just in case that might have gone wrong, and it passed without any 
issues.

So, something in the server2 (x86-64) client part causes a lock-up 
somewhere in the process. :-(

Gordan