[Gluster-devel] glusterfs client problem

Wed Apr 4 16:12:03 UTC 2007

Hi Shawn,

We committed a fix today which might have fixed your problem, can
you check with the latest source?

I tried with two instances of rsync but the problem did not
reproduce. If you still see the problem can you give more
detailed steps to reproduce the problem?

Thanks
Krishna

On 4/4/07, Shawn Northart <shawn at jupiterhosting.com> wrote:
> sorry, forgot that one.   the command used was:
> rsync -av --stats --progress --delete
>
> i haven't tried setting a bwlimit yet and i'd prefer not to have to if
> possible.   i've got roughly 450GB of data i want to sync over and the
> faster i can do it, the better.   i will try it just to see if it makes
> things any better.    the network is all copper gig with both interfaces
> trunked and vlan'd (on both client and server).
> a couple of other things that just came to mind are that i didn't see
> this exact behavior during the initial rsync.   i have three directories
> i'm trying to sync and when run concurrently, i would see the problem.
> when run one at a time, the sync would seem to complete without
> incident.   the only difference in the command i ran was that i omitted
> the --delete flag.
>
> ~Shawn
>
> On Tue, 2007-04-03 at 11:07 +0530, Krishna Srinivas wrote:
> > Hi Shawn,
> >
> > Can you give us the exact rsync command you used?
> >
> > Thanks
> > Krishna
> >
> > On 4/3/07, Shawn Northart <shawn at jupiterhosting.com> wrote:
> > > I'm noticing a problem with our test setup with regard to (reasonably)
> > > heavy read/write usage.
> > > the probelem we're having is that during an rsync of content, the sync
> > > bails due to the mount being lost with the following errors:
> > >
> > > <snip>
> > > rsync: stat "/vol/vol0/sites/TESTSITE.com/htdocs/trailers" failed:
> > > Transport endpoint is not connected (107)
> > > rsync: recv_generator: mkdir
> > > "/vol/vol0/sites/TESTSITE.com/htdocs/trialmember" failed: Transport
> > > endpoint is not connected (107)
> > > rsync: stat "/vol/vol0/sites/TESTSITE.com/htdocs/trialmember" failed:
> > > Transport endpoint is not connected (107)
> > > rsync: recv_generator: mkdir
> > > "/vol/vol0/sites/TESTSITE.com/htdocs/trialmember/bardoux" failed:
> > > Transport endpoint is not connected (107)
> > > rsync: stat "/vol/vol0/sites/TESTSITE.com/htdocs/trialmember/bardoux"
> > > failed: Transport endpoint is not connected (107)
> > > rsync: recv_generator: mkdir
> > > "/vol/vol0/sites/TESTSITE.com/htdocs/trialmember/images" failed:
> > > Transport endpoint is not connected (107)
> > > rsync: stat "/vol/vol0/sites/TESTSITE.com/htdocs/trialmember/images"
> > > failed: Transport endpoint is not connected (107)
> > > rsync: recv_generator: mkdir
> > > "/vol/vol0/sites/TESTSITE.com/htdocs/upgrade_trailers" failed: Transport
> > > endpoint is not connected (107)
> > > rsync: stat "/vol/vol0/sites/TESTSITE.com/htdocs/upgrade_trailers"
> > > failed: Transport endpoint is not connected (107)
> > > </snip>
> > >
> > > normal logging shows nothing either client or server-side, but running
> > > logging in DEBUG mode shows the following at the end of the client log
> > > right as it breaks:
> > >
> > > <snip>
> > > [Apr 02 13:25:11] [DEBUG/common-utils.c:213/gf_print_trace()]
> > > debug-backtrace:Got signal (11), printing backtrace
> > > [Apr 02 13:25:11] [DEBUG/common-utils.c:215/gf_print_trace()]
> > > debug-backtrace:/usr/local/glusterfs-mainline/lib/libglusterfs.so.0(gf_print_trace+0x1f) [0x2a9556030f]
> > > [Apr 02 13:25:11] [DEBUG/common-utils.c:215/gf_print_trace()]
> > > debug-backtrace:/lib64/tls/libc.so.6 [0x35b992e2b0]
> > > [Apr 02 13:25:11] [DEBUG/common-utils.c:215/gf_print_trace()]
> > > debug-backtrace:/lib64/tls/libpthread.so.0(__pthread_mutex_destroy+0)
> > > [0x35ba807ab0]
> > > [Apr 02 13:25:11] [DEBUG/common-utils.c:215/gf_print_trace()]
> > > debug-backtrace:/usr/local/glusterfs-mainline/lib/glusterfs/1.3.0-pre2.2/xlator/cluster/afr.so [0x2a958b840c]
> > > [Apr 02 13:25:11] [DEBUG/common-utils.c:215/gf_print_trace()]
> > > debug-backtrace:/usr/local/glusterfs-mainline/lib/glusterfs/1.3.0-pre2.2/xlator/protocol/client.so [0x2a957b06c2]
> > > [Apr 02 13:25:11] [DEBUG/common-utils.c:215/gf_print_trace()]
> > > debug-backtrace:/usr/local/glusterfs-mainline/lib/glusterfs/1.3.0-pre2.2/xlator/protocol/client.so [0x2a957b3196]
> > > [Apr 02 13:25:11] [DEBUG/common-utils.c:215/gf_print_trace()]
> > > debug-backtrace:/usr/local/glusterfs-mainline/lib/libglusterfs.so.0(epoll_iteration+0xf8) [0x2a955616f8]
> > > [Apr 02 13:25:11] [DEBUG/common-utils.c:215/gf_print_trace()]
> > > debug-backtrace:[glusterfs] [0x4031b7]
> > > [Apr 02 13:25:11] [DEBUG/common-utils.c:215/gf_print_trace()]
> > > debug-backtrace:/lib64/tls/libc.so.6(__libc_start_main+0xdb)
> > > [0x35b991c3fb]
> > > [Apr 02 13:25:11] [DEBUG/common-utils.c:215/gf_print_trace()]
> > > debug-backtrace:[glusterfs] [0x402bba]
> > > </snip>
> > >
> > >
> > > the server log shows the following at the time it breaks:
> > > <snip>
> > > [Apr 02 15:30:09] [ERROR/common-utils.c:54/full_rw()]
> > > libglusterfs:full_rw: 0 bytes r/w instead of 113
> > > [Apr 02 15:30:09]
> > > [DEBUG/protocol.c:244/gf_block_unserialize_transport()]
> > > libglusterfs/protocol:gf_block_unserialize_transport: full_read of
> > > header failed
> > > [Apr 02 15:30:09] [DEBUG/proto-srv.c:2868/proto_srv_cleanup()]
> > > protocol/server:cleaned up xl_private of 0x510470
> > > [Apr 02 15:30:09] [DEBUG/tcp-server.c:243/gf_transport_fini()]
> > > tcp/server:destroying transport object for 192.168.0.96:1012 (fd=8)
> > > [Apr 02 15:30:09] [ERROR/common-utils.c:54/full_rw()]
> > > libglusterfs:full_rw: 0 bytes r/w instead of 113
> > > [Apr 02 15:30:09]
> > > [DEBUG/protocol.c:244/gf_block_unserialize_transport()]
> > > libglusterfs/protocol:gf_block_unserialize_transport: full_read of
> > > header failed
> > > [Apr 02 15:30:09] [DEBUG/proto-srv.c:2868/proto_srv_cleanup()]
> > > protocol/server:cleaned up xl_private of 0x510160
> > > [Apr 02 15:30:09] [DEBUG/tcp-server.c:243/gf_transport_fini()]
> > > tcp/server:destroying transport object for 192.168.0.96:1013 (fd=7)
> > > [Apr 02 15:30:09] [ERROR/common-utils.c:54/full_rw()]
> > > libglusterfs:full_rw: 0 bytes r/w instead of 113
> > > [Apr 02 15:30:09]
> > > [DEBUG/protocol.c:244/gf_block_unserialize_transport()]
> > > libglusterfs/protocol:gf_block_unserialize_transport: full_read of
> > > header failed
> > > [Apr 02 15:30:09] [DEBUG/proto-srv.c:2868/proto_srv_cleanup()]
> > > protocol/server:cleaned up xl_private of 0x502300
> > > [Apr 02 15:30:09] [DEBUG/tcp-server.c:243/gf_transport_fini()]
> > > tcp/server:destroying transport object for 192.168.0.96:1014 (fd=4)
> > > </snip>
> > >
> > > we're using 4 bricks in this setup and for the moment, just one client
> > > (would like to scale between 20-30 clients and 4-8 server bricks).
> > > the same behavior is observed when used with or without any combination
> > > of any of the performance translators as well as with or without file
> > > replication.   alu, random, and round-robin schedulers were all used in
> > > our testing.
> > > the systems in question are running CentOS (4.4).   these logs are from
> > > our 64-bit systems but we have seen the exact same thing on the 32-bit
> > > ones as well.
> > > this (glusterfs) looks like it could be a good fit for some of the
> > > high-traffic domains we host, but unless we can resolve this issue,
> > > we'll have to continue using NFS.
> > >
> > >
> > > our current server-side (brick) config consists of the following:
> > > ##-- begin server config
> > > volume vol1
> > >   type storage/posix
> > >   option directory /vol/vol1/gfs
> > > end-volume
> > >
> > > volume vol2
> > >   type storage/posix
> > >   option directory /vol/vol2/gfs
> > > end-volume
> > >
> > > volume vol3
> > >   type storage/posix
> > >   option directory /vol/vol3/gfs
> > > end-volume
> > >
> > > volume brick1
> > >   type performance/io-threads
> > >   option thread-count 8
> > >   subvolumes vol1
> > > end-volume
> > >
> > > volume brick2
> > >   type performance/io-threads
> > >   option thread-count 8
> > >   subvolumes vol2
> > > end-volume
> > >
> > > volume brick3
> > >   type performance/io-threads
> > >   option thread-count 8
> > >   subvolumes vol3
> > > end-volume
> > >
> > > volume server
> > >   type protocol/server
> > >   option transport-type tcp/server
> > >   option bind-address 10.88.188.91
> > >   subvolumes brick1 brick2 brick3
> > >   option auth.ip.brick1.allow 192.168.0.*
> > >   option auth.ip.brick2.allow 192.168.0.*
> > >   option auth.ip.brick3.allow 192.168.0.*
> > > end-volume
> > > ##-- end server config
> > >
> > >
> > > our client config is as follows:
> > >
> > > ##-- begin client config
> > > volume test00.1
> > >   type protocol/client
> > >   option transport-type tcp/client
> > >   option remote-host 192.168.0.91
> > >   option remote-subvolume brick1
> > > end-volume
> > > volume test00.2
> > >   type protocol/client
> > >   option transport-type tcp/client
> > >   option remote-host 192.168.0.91
> > >   option remote-subvolume brick2
> > > end-volume
> > > volume test00.3
> > >   type protocol/client
> > >   option transport-type tcp/client
> > >   option remote-host 192.168.0.91
> > >   option remote-subvolume brick3
> > > end-volume
> > >
> > >
> > > volume test01.1
> > >   type protocol/client
> > >   option transport-type tcp/client
> > >   option remote-host 192.168.0.92
> > >   option remote-subvolume brick1
> > > end-volume
> > > volume test01.2
> > >   type protocol/client
> > >   option transport-type tcp/client
> > >   option remote-host 192.168.0.92
> > >   option remote-subvolume brick2
> > > end-volume
> > > volume test01.3
> > >   type protocol/client
> > >   option transport-type tcp/client
> > >   option remote-host 192.168.0.92
> > >   option remote-subvolume brick3
> > > end-volume
> > >
> > >
> > > volume test02.1
> > >   type protocol/client
> > >   option transport-type tcp/client
> > >   option remote-host 192.168.0.93
> > >   option remote-subvolume brick1
> > > end-volume
> > > volume test02.2
> > >   type protocol/client
> > >   option transport-type tcp/client
> > >   option remote-host 192.168.0.93
> > >   option remote-subvolume brick2
> > > end-volume
> > > volume test02.3
> > >   type protocol/client
> > >   option transport-type tcp/client
> > >   option remote-host 192.168.0.93
> > >   option remote-subvolume brick3
> > > end-volume
> > >
> > >
> > > volume test03.1
> > >   type protocol/client
> > >   option transport-type tcp/client
> > >   option remote-host 192.168.0.94
> > >   option remote-subvolume brick1
> > > end-volume
> > > volume test03.2
> > >   type protocol/client
> > >   option transport-type tcp/client
> > >   option remote-host 192.168.0.94
> > >   option remote-subvolume brick2
> > > end-volume
> > > volume test03.3
> > >   type protocol/client
> > >   option transport-type tcp/client
> > >   option remote-host 192.168.0.94
> > >   option remote-subvolume brick3
> > > end-volume
> > >
> > >
> > >
> > > volume afr0
> > >   type cluster/afr
> > >   subvolumes test00.1 test01.2 test02.3
> > >   option replicate *.html:3,*.db:1,*:3
> > > end-volume
> > >
> > > volume afr1
> > >   type cluster/afr
> > >   subvolumes test01.1 test02.2 test03.3
> > >   option replicate *.html:3,*.db:1,*:3
> > > end-volume
> > >
> > > volume afr2
> > >   type cluster/afr
> > >   subvolumes test02.1 test03.2 test00.3
> > >   option replicate *.html:3,*.db:1,*:3
> > > end-volume
> > >
> > > volume afr3
> > >   type cluster/afr
> > >   subvolumes test03.1 test00.2 test01.3
> > >   option replicate *.html:3,*.db:1,*:3
> > > end-volume
> > >
> > >
> > > volume bricks
> > >   type cluster/unify
> > >   subvolumes afr0 afr1 afr2 afr3
> > >   option readdir-force-success on
> > >
> > >   option scheduler alu
> > >   option alu.limits.min-free-disk  60GB
> > >   option alu.limits.max-open-files 10000
> > >
> > >   option alu.order
> > > disk-usage:read-usage:open-files-usage:write-usage:disk-speed-usage
> > >
> > >   option alu.disk-usage.entry-threshold 2GB
> > >   option alu.disk-usage.exit-threshold  60MB
> > >   option alu.open-files-usage.entry-threshold 1024
> > >   option alu.open-files-usage.exit-threshold 32
> > >   option alu.stat-refresh.interval 10sec
> > >
> > >  option alu.read-usage.entry-threshold 20%
> > >  option alu.read-usage.exit-threshold 4%
> > >  option alu.write-usage.entry-threshold 20%
> > >  option alu.write-usage.exit-threshold 4%
> > >
> > > end-volume
> > > ##-- end client config
> > >
> > >
> > > ~Shawn
> > >
> > >
> > >
> > > _______________________________________________
> > > Gluster-devel mailing list
> > > Gluster-devel at nongnu.org
> > > http://lists.nongnu.org/mailman/listinfo/gluster-devel
> > >
> >
>
>
>
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at nongnu.org
> http://lists.nongnu.org/mailman/listinfo/gluster-devel
>