[Gluster-devel] glusterfs client problem

Shawn Northart shawn at jupiterhosting.com
Mon Apr 2 22:46:55 UTC 2007


I'm noticing a problem with our test setup with regard to (reasonably)
heavy read/write usage.  
the probelem we're having is that during an rsync of content, the sync
bails due to the mount being lost with the following errors:

<snip>
rsync: stat "/vol/vol0/sites/TESTSITE.com/htdocs/trailers" failed:
Transport endpoint is not connected (107)
rsync: recv_generator: mkdir
"/vol/vol0/sites/TESTSITE.com/htdocs/trialmember" failed: Transport
endpoint is not connected (107)
rsync: stat "/vol/vol0/sites/TESTSITE.com/htdocs/trialmember" failed:
Transport endpoint is not connected (107)
rsync: recv_generator: mkdir
"/vol/vol0/sites/TESTSITE.com/htdocs/trialmember/bardoux" failed:
Transport endpoint is not connected (107)
rsync: stat "/vol/vol0/sites/TESTSITE.com/htdocs/trialmember/bardoux"
failed: Transport endpoint is not connected (107)
rsync: recv_generator: mkdir
"/vol/vol0/sites/TESTSITE.com/htdocs/trialmember/images" failed:
Transport endpoint is not connected (107)
rsync: stat "/vol/vol0/sites/TESTSITE.com/htdocs/trialmember/images"
failed: Transport endpoint is not connected (107)
rsync: recv_generator: mkdir
"/vol/vol0/sites/TESTSITE.com/htdocs/upgrade_trailers" failed: Transport
endpoint is not connected (107)
rsync: stat "/vol/vol0/sites/TESTSITE.com/htdocs/upgrade_trailers"
failed: Transport endpoint is not connected (107)
</snip>

normal logging shows nothing either client or server-side, but running
logging in DEBUG mode shows the following at the end of the client log
right as it breaks:

<snip>
[Apr 02 13:25:11] [DEBUG/common-utils.c:213/gf_print_trace()]
debug-backtrace:Got signal (11), printing backtrace
[Apr 02 13:25:11] [DEBUG/common-utils.c:215/gf_print_trace()]
debug-backtrace:/usr/local/glusterfs-mainline/lib/libglusterfs.so.0(gf_print_trace+0x1f) [0x2a9556030f]
[Apr 02 13:25:11] [DEBUG/common-utils.c:215/gf_print_trace()]
debug-backtrace:/lib64/tls/libc.so.6 [0x35b992e2b0]
[Apr 02 13:25:11] [DEBUG/common-utils.c:215/gf_print_trace()]
debug-backtrace:/lib64/tls/libpthread.so.0(__pthread_mutex_destroy+0)
[0x35ba807ab0]
[Apr 02 13:25:11] [DEBUG/common-utils.c:215/gf_print_trace()]
debug-backtrace:/usr/local/glusterfs-mainline/lib/glusterfs/1.3.0-pre2.2/xlator/cluster/afr.so [0x2a958b840c]
[Apr 02 13:25:11] [DEBUG/common-utils.c:215/gf_print_trace()]
debug-backtrace:/usr/local/glusterfs-mainline/lib/glusterfs/1.3.0-pre2.2/xlator/protocol/client.so [0x2a957b06c2]
[Apr 02 13:25:11] [DEBUG/common-utils.c:215/gf_print_trace()]
debug-backtrace:/usr/local/glusterfs-mainline/lib/glusterfs/1.3.0-pre2.2/xlator/protocol/client.so [0x2a957b3196]
[Apr 02 13:25:11] [DEBUG/common-utils.c:215/gf_print_trace()]
debug-backtrace:/usr/local/glusterfs-mainline/lib/libglusterfs.so.0(epoll_iteration+0xf8) [0x2a955616f8]
[Apr 02 13:25:11] [DEBUG/common-utils.c:215/gf_print_trace()]
debug-backtrace:[glusterfs] [0x4031b7]
[Apr 02 13:25:11] [DEBUG/common-utils.c:215/gf_print_trace()]
debug-backtrace:/lib64/tls/libc.so.6(__libc_start_main+0xdb)
[0x35b991c3fb]
[Apr 02 13:25:11] [DEBUG/common-utils.c:215/gf_print_trace()]
debug-backtrace:[glusterfs] [0x402bba]
</snip>


the server log shows the following at the time it breaks:
<snip>
[Apr 02 15:30:09] [ERROR/common-utils.c:54/full_rw()]
libglusterfs:full_rw: 0 bytes r/w instead of 113
[Apr 02 15:30:09]
[DEBUG/protocol.c:244/gf_block_unserialize_transport()]
libglusterfs/protocol:gf_block_unserialize_transport: full_read of
header failed
[Apr 02 15:30:09] [DEBUG/proto-srv.c:2868/proto_srv_cleanup()]
protocol/server:cleaned up xl_private of 0x510470
[Apr 02 15:30:09] [DEBUG/tcp-server.c:243/gf_transport_fini()]
tcp/server:destroying transport object for 192.168.0.96:1012 (fd=8)
[Apr 02 15:30:09] [ERROR/common-utils.c:54/full_rw()]
libglusterfs:full_rw: 0 bytes r/w instead of 113
[Apr 02 15:30:09]
[DEBUG/protocol.c:244/gf_block_unserialize_transport()]
libglusterfs/protocol:gf_block_unserialize_transport: full_read of
header failed
[Apr 02 15:30:09] [DEBUG/proto-srv.c:2868/proto_srv_cleanup()]
protocol/server:cleaned up xl_private of 0x510160
[Apr 02 15:30:09] [DEBUG/tcp-server.c:243/gf_transport_fini()]
tcp/server:destroying transport object for 192.168.0.96:1013 (fd=7)
[Apr 02 15:30:09] [ERROR/common-utils.c:54/full_rw()]
libglusterfs:full_rw: 0 bytes r/w instead of 113
[Apr 02 15:30:09]
[DEBUG/protocol.c:244/gf_block_unserialize_transport()]
libglusterfs/protocol:gf_block_unserialize_transport: full_read of
header failed
[Apr 02 15:30:09] [DEBUG/proto-srv.c:2868/proto_srv_cleanup()]
protocol/server:cleaned up xl_private of 0x502300
[Apr 02 15:30:09] [DEBUG/tcp-server.c:243/gf_transport_fini()]
tcp/server:destroying transport object for 192.168.0.96:1014 (fd=4)
</snip>

we're using 4 bricks in this setup and for the moment, just one client
(would like to scale between 20-30 clients and 4-8 server bricks).
the same behavior is observed when used with or without any combination
of any of the performance translators as well as with or without file
replication.   alu, random, and round-robin schedulers were all used in
our testing.
the systems in question are running CentOS (4.4).   these logs are from
our 64-bit systems but we have seen the exact same thing on the 32-bit
ones as well.
this (glusterfs) looks like it could be a good fit for some of the
high-traffic domains we host, but unless we can resolve this issue,
we'll have to continue using NFS.


our current server-side (brick) config consists of the following:
##-- begin server config
volume vol1 
  type storage/posix
  option directory /vol/vol1/gfs
end-volume

volume vol2
  type storage/posix
  option directory /vol/vol2/gfs
end-volume

volume vol3
  type storage/posix
  option directory /vol/vol3/gfs
end-volume

volume brick1
  type performance/io-threads
  option thread-count 8
  subvolumes vol1
end-volume

volume brick2
  type performance/io-threads
  option thread-count 8
  subvolumes vol2
end-volume

volume brick3
  type performance/io-threads
  option thread-count 8
  subvolumes vol3
end-volume

volume server
  type protocol/server
  option transport-type tcp/server
  option bind-address 10.88.188.91
  subvolumes brick1 brick2 brick3
  option auth.ip.brick1.allow 192.168.0.*
  option auth.ip.brick2.allow 192.168.0.*
  option auth.ip.brick3.allow 192.168.0.*
end-volume
##-- end server config


our client config is as follows:

##-- begin client config
volume test00.1
  type protocol/client
  option transport-type tcp/client
  option remote-host 192.168.0.91
  option remote-subvolume brick1
end-volume
volume test00.2
  type protocol/client
  option transport-type tcp/client
  option remote-host 192.168.0.91
  option remote-subvolume brick2
end-volume
volume test00.3
  type protocol/client
  option transport-type tcp/client
  option remote-host 192.168.0.91
  option remote-subvolume brick3
end-volume


volume test01.1
  type protocol/client
  option transport-type tcp/client
  option remote-host 192.168.0.92
  option remote-subvolume brick1
end-volume
volume test01.2
  type protocol/client
  option transport-type tcp/client
  option remote-host 192.168.0.92
  option remote-subvolume brick2
end-volume
volume test01.3
  type protocol/client
  option transport-type tcp/client
  option remote-host 192.168.0.92
  option remote-subvolume brick3
end-volume


volume test02.1
  type protocol/client
  option transport-type tcp/client
  option remote-host 192.168.0.93
  option remote-subvolume brick1
end-volume
volume test02.2
  type protocol/client
  option transport-type tcp/client
  option remote-host 192.168.0.93
  option remote-subvolume brick2
end-volume
volume test02.3
  type protocol/client
  option transport-type tcp/client
  option remote-host 192.168.0.93
  option remote-subvolume brick3
end-volume


volume test03.1
  type protocol/client
  option transport-type tcp/client
  option remote-host 192.168.0.94
  option remote-subvolume brick1
end-volume
volume test03.2
  type protocol/client
  option transport-type tcp/client
  option remote-host 192.168.0.94
  option remote-subvolume brick2
end-volume
volume test03.3
  type protocol/client
  option transport-type tcp/client
  option remote-host 192.168.0.94
  option remote-subvolume brick3
end-volume



volume afr0
  type cluster/afr
  subvolumes test00.1 test01.2 test02.3
  option replicate *.html:3,*.db:1,*:3
end-volume

volume afr1
  type cluster/afr
  subvolumes test01.1 test02.2 test03.3
  option replicate *.html:3,*.db:1,*:3
end-volume

volume afr2
  type cluster/afr
  subvolumes test02.1 test03.2 test00.3
  option replicate *.html:3,*.db:1,*:3
end-volume

volume afr3
  type cluster/afr
  subvolumes test03.1 test00.2 test01.3
  option replicate *.html:3,*.db:1,*:3
end-volume


volume bricks
  type cluster/unify
  subvolumes afr0 afr1 afr2 afr3
  option readdir-force-success on

  option scheduler alu
  option alu.limits.min-free-disk  60GB
  option alu.limits.max-open-files 10000

  option alu.order
disk-usage:read-usage:open-files-usage:write-usage:disk-speed-usage

  option alu.disk-usage.entry-threshold 2GB
  option alu.disk-usage.exit-threshold  60MB
  option alu.open-files-usage.entry-threshold 1024
  option alu.open-files-usage.exit-threshold 32
  option alu.stat-refresh.interval 10sec

 option alu.read-usage.entry-threshold 20%
 option alu.read-usage.exit-threshold 4%
 option alu.write-usage.entry-threshold 20%
 option alu.write-usage.exit-threshold 4%

end-volume
##-- end client config


~Shawn






More information about the Gluster-devel mailing list