[Gluster-devel] pre6 hanging problems

August R. Wohlt glusterfs at isidore.net
Thu Jul 26 20:00:49 UTC 2007


Hi avati,

When I run it without gdb, it still has the same behavior. It'll run fine
for a few hours under load and then freeze. When it does, the client spews
these to the logs forever. When I kill glusterfs and remount the directory,
everything's fine again:

2007-07-26 12:21:31 D [fuse-bridge.c:344:fuse_entry_cbk] glusterfs-fuse: ERR
=> -1 (107)
2007-07-26 12:21:31 D [inode.c:285:__destroy_inode] fuse/inode: destroy
inode(0)
2007-07-26 12:23:34 W [client-protocol.c:4158:client_protocol_reconnect]
brick: attempting reconnect
2007-07-26 12:23:34 D [tcp-client.c:178:tcp_connect] brick: connection on 4
still in progress - try later
2007-07-26 12:29:51 W [client-protocol.c:4158:client_protocol_reconnect]
brick: attempting reconnect
2007-07-26 12:29:51 E [tcp-client.c:170:tcp_connect] brick: non-blocking
connect() returned: 110 (Connection timed out)

:g

On 7/26/07, Anand Avati <avati at zresearch.com> wrote:
>
> August,
>  It seems to me that you were running the client in GDB, and for some
> reason that particular client bailed out. While bailing out the client
> raises SIGCONT which has been caught by gdb (gdb catches all signals before
> letting the signal handlers take over). the backtrace you have attached is
> NOT a crash, you had to just 'c' (continue) at the gdb. And most likely,
> this is what has given the 'hung' effect as well.
> Is this reproducible for you?
>
> thanks,
> avati
>
> 2007/7/26, August R. Wohlt <glusterfs at isidore.net>:
> >
> > Hi all -
> >
> > I have client and server set up with the pre6 version of gluserfs.
> > Several
> > times a day the client mount will freeze up as does any command that
> > tries
> > to read from the mountpoint. I have to kill the glusterfs process,
> > unmount
> > the directory and remount it to get it to work again.
> >
> > When this happens, there is another glusterfs client on other machines
> > connected to the same server that does not get disconnected. So the
> > timeout
> > message in the logs is confusing to me. If it's really timing out
> > wouldn't
> > the other server be disconnected, too?
> >
> > This is on CentOS 5 with fuse 2.7.0-glfs.
> >
> > When it happens, here's what shows up in the client:
> >
> > ...
> > 2007-07-25 09:45:59 D [inode.c:327:__active_inode] fuse/inode:
> > activating
> > inode(4210807), lru=0/1024
> > 2007-07-25 09:45:59 D [inode.c:285:__destroy_inode] fuse/inode: destroy
> > inode(4210807)
> > 2007-07-25 12:37:26 W [client-protocol.c:211:call_bail] brick:
> > activating
> > bail-out. pending frames = 1. last sent =
> > 2007-07-25 12:33:42. last received = 2007-07-25 11:42:59
> > transport-timeout =
> > 120
> > 2007-07-25 12:37:26 C [client-protocol.c:219:call_bail] brick: bailing
> > transport
> > 2007-07-25 12:37:26 W [client-protocol.c:4189:client_protocol_cleanup]
> > brick: cleaning up state in transport object
> > 0x80a03d0
> > 2007-07-25 12:37:26 W [client-protocol.c:4238:client_protocol_cleanup]
> > brick: forced unwinding frame type(0) op(15)
> > 2007-07-25 12:37:26 C [tcp.c:81:tcp_disconnect] brick: connection
> > disconnected
> >
> > When it happens, here's what shows up in the server:
> >
> > 2007-07-25 15:37:40 E [protocol.c:346:gf_block_unserialize_transport]
> > libglusterfs/protocol: full_read of block failed: peer (
> > 192.168.2.3:1023)
> > 2007-07-25 15:37:40 C [tcp.c:81:tcp_disconnect] server: connection
> > disconnected
> > 2007-07-25 15:37:40 E [protocol.c:251:gf_block_unserialize_transport]
> > libglusterfs/protocol: EOF from peer ( 192.168.2.4:1023)
> > 2007-07-25 15:37:40 C [tcp.c:81:tcp_disconnect] server: connection
> > disconnected
> >
> > And here's the client backtrace:
> >
> > (gdb) bt
> > #0  0x0032e7a2 in _dl_sysinfo_int80 () from /lib/ld- linux.so.2
> > #1  0x005a3824 in raise () from /lib/tls/libpthread.so.0
> > #2  0x00655b0c in tcp_bail (this=0x80a03d0) at
> > ../../../../transport/tcp/tcp.c:146
> > #3  0x00695bbc in transport_bail (this=0x80a03d0) at transport.c :192
> > #4  0x00603a16 in call_bail (trans=0x80a03d0) at client-protocol.c:220
> > #5  0x00696870 in gf_timer_proc (ctx=0xbffeec30) at timer.c:119
> > #6  0x0059d3cc in start_thread () from /lib/tls/libpthread.so.0
> > #7  0x00414c3e in clone () from /lib/tls/libc.so.6
> >
> >
> > client config:
> >
> > ### Add client feature and attach to remote subvolume
> > volume brick
> >    type protocol/client
> >    option transport-type tcp/client     # for TCP/IP transport
> >    option remote-host 192.168.2.5       # IP address of the remote brick
> >    option remote-subvolume brick_1  # name of the remote volume
> > end-volume
> >
> > # #### Add writeback feature
> >   volume brick-wb
> >     type performance/write-behind
> >     option aggregate-size 131072 # unit in bytes
> >     subvolumes brick
> >   end-volume
> >
> > server config:
> >
> > ### Export volume "brick" with the contents of "/home/export" directory.
> >
> > volume brick_1
> >    type storage/posix
> >    option directory /home/vg_3ware1/vivalog/brick_1
> > end-volume
> >
> > volume brick_2
> >    type storage/posix
> >    option directory /home/vg_3ware1/vivalog/brick_2
> > end-volume
> >
> > ### Add network serving capability to above brick.
> > volume server
> >    type protocol/server
> >    option transport-type tcp/server     # For TCP/IP transport
> >    option bind-address 192.168.2.5     # Default is to listen on all
> > interfaces
> >    subvolumes brick_1
> >    option auth.ip.brick_2.allow * # Allow access to "brick" volume
> >    option auth.ip.brick_1.allow * # Allow access to "brick" volume
> > end-volume
> >
> > ps I have one server serving two volume bricks to two physically
> > distinct
> > clients.  I assume this is okay--that I don't need to have two separate
> > server declarations.
> > _______________________________________________
> > Gluster-devel mailing list
> > Gluster-devel at nongnu.org
> > http://lists.nongnu.org/mailman/listinfo/gluster-devel
> >
>
>
>
> --
> Anand V. Avati



More information about the Gluster-devel mailing list