[Gluster-devel] pre6 hanging problems

Wed Jul 25 20:12:27 UTC 2007

Hi all -

I have client and server set up with the pre6 version of gluserfs. Several
times a day the client mount will freeze up as does any command that tries
to read from the mountpoint. I have to kill the glusterfs process, unmount
the directory and remount it to get it to work again.

When this happens, there is another glusterfs client on other machines
connected to the same server that does not get disconnected. So the timeout
message in the logs is confusing to me. If it's really timing out wouldn't
the other server be disconnected, too?

This is on CentOS 5 with fuse 2.7.0-glfs.

When it happens, here's what shows up in the client:

...
2007-07-25 09:45:59 D [inode.c:327:__active_inode] fuse/inode: activating
inode(4210807), lru=0/1024
2007-07-25 09:45:59 D [inode.c:285:__destroy_inode] fuse/inode: destroy
inode(4210807)
2007-07-25 12:37:26 W [client-protocol.c:211:call_bail] brick: activating
bail-out. pending frames = 1. last sent =
2007-07-25 12:33:42. last received = 2007-07-25 11:42:59 transport-timeout =
120
2007-07-25 12:37:26 C [client-protocol.c:219:call_bail] brick: bailing
transport
2007-07-25 12:37:26 W [client-protocol.c:4189:client_protocol_cleanup]
brick: cleaning up state in transport object
0x80a03d0
2007-07-25 12:37:26 W [client-protocol.c:4238:client_protocol_cleanup]
brick: forced unwinding frame type(0) op(15)
2007-07-25 12:37:26 C [tcp.c:81:tcp_disconnect] brick: connection
disconnected

When it happens, here's what shows up in the server:

2007-07-25 15:37:40 E [protocol.c:346:gf_block_unserialize_transport]
libglusterfs/protocol: full_read of block failed: peer (
192.168.2.3:1023)
2007-07-25 15:37:40 C [tcp.c:81:tcp_disconnect] server: connection
disconnected
2007-07-25 15:37:40 E [protocol.c:251:gf_block_unserialize_transport]
libglusterfs/protocol: EOF from peer (192.168.2.4:1023)
2007-07-25 15:37:40 C [tcp.c:81:tcp_disconnect] server: connection
disconnected

And here's the client backtrace:

(gdb) bt
#0  0x0032e7a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
#1  0x005a3824 in raise () from /lib/tls/libpthread.so.0
#2  0x00655b0c in tcp_bail (this=0x80a03d0) at
../../../../transport/tcp/tcp.c:146
#3  0x00695bbc in transport_bail (this=0x80a03d0) at transport.c:192
#4  0x00603a16 in call_bail (trans=0x80a03d0) at client-protocol.c:220
#5  0x00696870 in gf_timer_proc (ctx=0xbffeec30) at timer.c:119
#6  0x0059d3cc in start_thread () from /lib/tls/libpthread.so.0
#7  0x00414c3e in clone () from /lib/tls/libc.so.6

client config:

 ### Add client feature and attach to remote subvolume
 volume brick
   type protocol/client
   option transport-type tcp/client     # for TCP/IP transport
   option remote-host 192.168.2.5       # IP address of the remote brick
   option remote-subvolume brick_1  # name of the remote volume
 end-volume

 # #### Add writeback feature
  volume brick-wb
    type performance/write-behind
    option aggregate-size 131072 # unit in bytes
    subvolumes brick
  end-volume

server config:

 ### Export volume "brick" with the contents of "/home/export" directory.
 volume brick_1
   type storage/posix
   option directory /home/vg_3ware1/vivalog/brick_1
 end-volume

 volume brick_2
   type storage/posix
   option directory /home/vg_3ware1/vivalog/brick_2
 end-volume

 ### Add network serving capability to above brick.
 volume server
   type protocol/server
   option transport-type tcp/server     # For TCP/IP transport
   option bind-address 192.168.2.5     # Default is to listen on all
interfaces
   subvolumes brick_1
   option auth.ip.brick_2.allow * # Allow access to "brick" volume
   option auth.ip.brick_1.allow * # Allow access to "brick" volume
 end-volume

ps I have one server serving two volume bricks to two physically distinct
clients.  I assume this is okay--that I don't need to have two separate
server declarations.