[Gluster-devel] Weird lock-ups

Mon Oct 27 20:24:53 UTC 2008

On Mon, 27 Oct 2008, gordan at bobich.net wrote:

>
>
> On Mon, 27 Oct 2008, gordan at bobich.net wrote:
>
>> 
>>
>>  On Mon, 27 Oct 2008, Gordan Bobic wrote:
>> 
>> >   Krishna Srinivas wrote:
>> > >    On Tue, Oct 21, 2008 at 5:54 PM, Gordan Bobic <gordan at bobich.net> 
>> > >    wrote:
>> > > >    I'm starting to see lock-ups when using a single-file 
>> > > >    client/server setup.
>> > > > 
>> > > >    machine1 (x86): =================================
>> > > >    volume home2
>> > > >           type protocol/client
>> > > >           option transport-type tcp/client
>> > > >           option remote-host 192.168.3.1
>> > > >           option remote-subvolume home2
>> > > >    end-volume
>> > > > 
>> > > >    volume home-store
>> > > >           type storage/posix
>> > > >           option directory /gluster/home
>> > > >    end-volume
>> > > > 
>> > > >    volume home1
>> > > >           type features/posix-locks
>> > > >           subvolumes home-store
>> > > >    end-volume
>> > > > 
>> > > >    volume server
>> > > >           type protocol/server
>> > > >           option transport-type tcp/server
>> > > >           subvolumes home1
>> > > >           option auth.ip.home1.allow 127.0.0.1,192.168.*
>> > > >    end-volume
>> > > > 
>> > > >    volume home
>> > > >           type cluster/afr
>> > > >           subvolumes home1 home2
>> > > >           option read-subvolume home1
>> > > >    end-volume
>> > > > 
>> > > >    machine2 (x86-64): =================================
>> > > >    volume home1
>> > > >           type protocol/client
>> > > >           option transport-type tcp/client
>> > > >           option remote-host 192.168.0.1
>> > > >           option remote-subvolume home1
>> > > >    end-volume
>> > > > 
>> > > >    volume home-store
>> > > >           type storage/posix
>> > > >           option directory /gluster/home
>> > > >    end-volume
>> > > > 
>> > > >    volume home2
>> > > >           type features/posix-locks
>> > > >           subvolumes home-store
>> > > >    end-volume
>> > > > 
>> > > >    volume server
>> > > >           type protocol/server
>> > > >           option transport-type tcp/server
>> > > >           subvolumes home2
>> > > >           option auth.ip.home2.allow 127.0.0.1,192.168.*
>> > > >    end-volume
>> > > > 
>> > > >    volume home
>> > > >           type cluster/afr
>> > > >           subvolumes home1 home2
>> > > >           option read-subvolume home2
>> > > >    end-volume
>> > > > 
>> > > >    ==================
>> > > > 
>> > > >    Do those configs look sane?
>> > > > 
>> > > >    When one machine is running on it's own, it's fine. Other 
>> > > >    client-only
>> > > >    machines can connect to it without any problems. However, as soon 
>> > > >    as the
>> > > >    second client/server comes up, typically the first ls access on 
>> > > >    the
>> > > >    directory will lock the whole thing up solid.
>> > > > 
>> > > >    Interestingly, on the x86 machine, the glusterfs process can 
>> > > >    always be
>> > > >    killed. Not so on the x86-64 machine (the 2nd machine that comes 
>> > > >    up). kill
>> > > >    -9 doesn't kill it. The only way to clear the lock-up is to 
>> > > >    reboot.
>> > > > 
>> > > >    Using the 1.3.12 release compiled into an RPM on both machines 
>> > > >    (CentOS 5.2).
>> > > > 
>> > > >    One thing worthy of note is that machine2 is nfsrooted / network 
>> > > >    booted. It
>> > > >    has local disks in it, and a local dmraid volume is mounted under 
>> > > >    /gluster
>> > > >    on it (machine1 has a disk-backed root).
>> > > > 
>> > > >    So, on machine1:
>> > > >    / is local disk
>> > > >    on machine2:
>> > > >    / is NFS
>> > > >    /gluster is local disk
>> > > >    /gluster/home is exported in the volume spec for AFR.
>> > > > 
>> > > >    If /gluster/home is newly created, it tends to get a little 
>> > > >    further, but
>> > > >    still locks up pretty quickly. If I try to execute find /home once 
>> > > >    it is
>> > > >    mounted, it will eventually hang, and the only thing of note I 
>> > > >    could see in
>> > > >    the logs is that it said "active lock found" at the point where it
>> > > 
>> > >    Do you see this error on server1 or server2? Any other clues in the 
>> > >    logs?
>> > 
>> >   Access to the FS locks up on both server1 and server2.
>> > 
>> >   I have split up the setup to separate cliend and server on server2
>> >   (x86-64), and have tried to get it to sync up just the file 
>> >   placeholders
>> >   (find . at the root of the glusterfs mounted tree), and this, too 
>> >   causes a
>> >   lock-up. I have managed to kill the glusterfsd process, but only after
>> >   killing the glusterfs process first.
>> > 
>> >   This ends up in the logs on server2, in the glusterfs (client) log:
>> >   2008-10-27 18:44:31 C [client-protocol.c:212:call_bail] home2: bailing
>> >   transport
>> >   2008-10-27 18:44:31 E [client-protocol.c:4834:client_protocol_cleanup]
>> >   home2: forced unwinding frame type(1) op(36) reply=@0x612b70
>> >   2008-10-27 18:44:31 E [client-protocol.c:4215:client_setdents_cbk] 
>> >   home2:
>> >   no proper reply from server, returning ENOTCONN
>> >   2008-10-27 18:44:31 E [afr_self_heal.c:155:afr_lds_setdents_cbk] 
>> >   mirror:
>> >   op_ret=-1 op_errno=107
>> >   2008-10-27 18:44:31 E [client-protocol.c:4834:client_protocol_cleanup]
>> >   home2: forced unwinding frame type(1) op(34) reply=@0x612b70
>> >   2008-10-27 18:44:31 E [client-protocol.c:4430:client_lookup_cbk] home2: 
>> >   no
>> >   proper reply from server, returning ENOTCONN
>> >   2008-10-27 18:44:31 E [fuse-bridge.c:468:fuse_entry_cbk] 
>> >   glusterfs-fuse:
>> >   19915: (34) /gordan/bin => -1 (5)
>> >   2008-10-27 18:45:51 C [client-protocol.c:212:call_bail] home2: bailing
>> >   transport
>> >   2008-10-27 18:45:51 E [client-protocol.c:4834:client_protocol_cleanup]
>> >   home2: forced unwinding frame type(1) op(0) reply=@0x9ae090
>> >   2008-10-27 18:45:51 E [client-protocol.c:2688:client_stat_cbk] home2: 
>> >   no
>> >   proper reply from server, returning ENOTCONN
>> >   2008-10-27 18:45:51 E [afr.c:3298:afr_stat_cbk] mirror: (child=home2)
>> >   op_ret=-1 op_errno=107
>> >   2008-10-27 18:45:51 E [client-protocol.c:4834:client_protocol_cleanup]
>> >   home2: forced unwinding frame type(1) op(34) reply=@0x9ae090
>> >   2008-10-27 18:45:51 E [client-protocol.c:4430:client_lookup_cbk] home2: 
>> >   no
>> >   proper reply from server, returning ENOTCONN
>> >   2008-10-27 18:45:51 E [client-protocol.c:325:client_protocol_xfer] 
>> >   home2:
>> >   transport_submit failed
>> >   2008-10-27 18:45:51 E [client-protocol.c:325:client_protocol_xfer] 
>> >   home2:
>> >   transport_submit failed
>> >   2008-10-27 18:45:51 E [client-protocol.c:4834:client_protocol_cleanup]
>> >   home2: forced unwinding frame type(1) op(34) reply=@0x9ae090
>> >   2008-10-27 18:45:51 E [client-protocol.c:4430:client_lookup_cbk] home2: 
>> >   no
>> >   proper reply from server, returning ENOTCONN
>> >   2008-10-27 18:46:23 E [protocol.c:271:gf_block_unserialize_transport]
>> >   home1: EOF from peer (192.168.0.1:6996)
>> >   2008-10-27 18:46:23 E [client-protocol.c:4834:client_protocol_cleanup]
>> >   home1: forced unwinding frame type(2) op(5) reply=@0x2aaaac00
>> >   1230
>> >   2008-10-27 18:46:23 E [client-protocol.c:4246:client_lock_cbk] home1: 
>> >   no
>> >   proper reply from server, returning ENOTCONN
>> > 
>> >   I think this was generated in the logs only after the server2 client 
>> >   was
>> >   forcefully killed, not when the lock-up occured, though.
>> > 
>> >   If I merge the client and server config into a single volume definition 
>> >   on
>> >   server2, the lock-up happens as soon as the FS is mounted. If
>> >   server2-server gets brought up first, the server1-combined, then
>> >   server2-client, it seems to last a bit longer.
>> > 
>> >   I'm wondering now if it fails on a particular file/file type (e.g. a
>> >   socket).
>> > 
>> >   But whatever is causing it, it is completely reproducible. I haven't 
>> >   been
>> >   able to keep it running under these circumstances for long enough to
>> >   finish loading X with the home directory mounted over glusterfs with 
>> >   both
>> >   servers running.
>>
>>  Update - the problem seems to be somehow linked to running the client on
>>  server2. If I start up server2-server, and server1 client+server, I can
>>  execute a complete "find ." on the gluster mounted volume (from server1,
>>  obviously, server2 doesn't have a client running), and instigate a full
>>  resync by the usual "find /home -type f -exec head -c1 {} \; > /dev/null".
>>  This all works, and all files end up on server2.
>>
>>  But doing this with client up and running on server2 makes the whole
>>  process lock up. Sometimes I can only get the first "ls -la" on the base
>>  of the mounted tree before everything subsequent locks up and ends up
>>  waiting until I "killall glusterfs" on server2. At this point glusterfsd
>>  (server) on server2 is unkillable until glusterfs (client) is killed
>>  first.
>>
>>  I have just completed a full rescan of the underlying file system on
>>  server1 just in case that might have gone wrong, and it passed without any
>>  issues.
>>
>>  So, something in the server2 (x86-64) client part causes a lock-up
>>  somewhere in the process. :-(
>
> I just noticed this sort of things ends up occassionally logged on server1, 
> but not for all files:
>
> 2008-10-27 19:51:13 E [afr.c:1164:afr_selfheal_unlock_cbk] home: 
> (path=/path/file child=home1) op_ret=-1 op_errno=2
> 2008-10-27 19:51:13 E [afr.c:2203:afr_open] home: self heal failed, returning 
> EIO
> 2008-10-27 19:51:13 E [fuse-bridge.c:715:fuse_fd_cbk] glusterfs-fuse: 92274: 
> (12) /path/file => -1 (5)
>
> However, cat-ing the file (again, on server1), either through glusterfs or 
> the underlying file system works fine and returns no error.

I just did a bit more digging, and it seems that the I/O error on read 
occurs only when reading the FS through glusterfs/fuse. It seems to occur 
randomly, and re-trying seems to work OK. This is still in the config 
where the 2nd server isn't acting as a client.

Another interesting thing I have observed - while the servers are syncing, 
cat-ing a file that is on server1 but not yet synced to server2 from a 3rd 
machine that has client-side AFR enabled for both server1 and server2, 
will sometimes return an empty file. After a few times, the file gets 
synced, and then the correct contents get returned. I thought the clocks 
might be out of sync or something else like that, but resyncing them all 
showed they were at most out of sync by a few milliseconds.

All machines in question are CentOS (i.e. RHEL) 5.2, with gluster packages 
built as rpms from the tar ball, version 1.3.12. All machines except 
server2 are x86 (server2 is x86-64).

Is there any way I can provide more meaningful and less anecdotal debug 
information? The problem I am seeing is very real and reproducible, but 
not a great deal (of useful things, at least) ends up in the logs.

Gordan