[Gluster-devel] odd connection issues under high write load (now read load)

Fri Jun 22 23:42:42 UTC 2007

same setup, rebuilt with 2.4 patch 181, installed on the 2 servers and 1 
client

had much better performance but now we fail hard instead of soft, the 
mount flat out dies and debug throws ugly backtrace messages
our writes operated great, no fails, no stale mounts, worked great, but 
the read stress test caused horrible horrible mount death that crashes 
the glusterfs client


--this block is repeated--
[Jun 22 19:23:50] [CRITICAL/client-protocol.c:218/call_bail()] 
client/protocol:bailing transport
[Jun 22 19:23:50] [DEBUG/tcp.c:123/cont_hand()] tcp:forcing 
poll/read/write to break on blocked socket (if any)
--about 400-500 times--

[Jun 22 19:23:50] [CRITICAL/common-utils.c:215/gf_print_trace()] 
debug-backtrace:Got signal (11), printing backtrace
[Jun 22 19:23:50] [CRITICAL/common-utils.c:217/gf_print_trace()] 
debug-backtrace:/usr/local/lib/libglusterfs.so.0(gf_print_trace+0x26) 
[0x6bce1a]
[Jun 22 19:23:50] [CRITICAL/common-utils.c:217/gf_print_trace()] 
debug-backtrace:/lib/tls/libc.so.6 [0x2668c8]
[Jun 22 19:23:50] [CRITICAL/common-utils.c:217/gf_print_trace()] 
debug-backtrace:/usr/local/lib/glusterfs/1.3.0-pre3/xlator/cluster/afr.so 
[0x69d4b3]
[Jun 22 19:23:50] [CRITICAL/common-utils.c:217/gf_print_trace()] 
debug-backtrace:/usr/local/lib/glusterfs/1.3.0-pre3/xlator/performance/stat-prefetch.so 
[0x120999]
[Jun 22 19:23:50] [CRITICAL/common-utils.c:217/gf_print_trace()] 
debug-backtrace:/usr/local/lib/libglusterfs.so.0 [0x6bb039]
[Jun 22 19:23:50] [CRITICAL/common-utils.c:217/gf_print_trace()] 
debug-backtrace:/usr/local/lib/libglusterfs.so.0 [0x6bb039]
[Jun 22 19:23:50] [CRITICAL/common-utils.c:217/gf_print_trace()] 
debug-backtrace:/usr/local/lib/glusterfs/1.3.0-pre3/xlator/protocol/client.so 
[0x118a1c]
[Jun 22 19:23:50] [CRITICAL/common-utils.c:217/gf_print_trace()] 
debug-backtrace:/usr/local/lib/glusterfs/1.3.0-pre3/xlator/protocol/client.so 
[0x11adfb]
[Jun 22 19:23:50] [CRITICAL/common-utils.c:217/gf_print_trace()] 
debug-backtrace:/usr/local/lib/libglusterfs.so.0(transport_notify+0x13) 
[0x6bdc5f]
[Jun 22 19:23:50] [CRITICAL/common-utils.c:217/gf_print_trace()] 
debug-backtrace:/usr/local/lib/libglusterfs.so.0(sys_epoll_iteration+0xcf) 
[0x6be2cb]
[Jun 22 19:23:50] [CRITICAL/common-utils.c:217/gf_print_trace()] 
debug-backtrace:/usr/local/lib/libglusterfs.so.0(poll_iteration+0x1b) 
[0x6bddf7]
[Jun 22 19:23:50] [CRITICAL/common-utils.c:217/gf_print_trace()] 
debug-backtrace:[glusterfs] [0x804a317]
[Jun 22 19:23:50] [CRITICAL/common-utils.c:217/gf_print_trace()] 
debug-backtrace:/lib/tls/libc.so.6(__libc_start_main+0xd3) [0x253e23]
[Jun 22 19:23:50] [CRITICAL/common-utils.c:217/gf_print_trace()] 
debug-backtrace:[glusterfs] [0x8049dfd]

Brent A Nelson wrote:
> I believe you'll find that this works in the tla repository (2.4; 2.5 
> is significantly different code), which has a few patches beyond pre4.
>
> On Thu, 21 Jun 2007, Daniel wrote:
>
>> 1.30-pre4
>> afr across 2 servers
>>
>> servers are io-streams, no write back no read forward
>> TCP on a Gigabit network
>>
>> We setup a stresstest script to test the client using php and about 
>> 36 instances of the script, and occasionally we get a "transport end 
>> point not connected" which kills all of the instances (intentionally, 
>> they halt on error, but it means the mount went stale), but without 
>> any intervention gluster picks up again and seems to operate fine 
>> when we re-run the scripts
>>
>> we're pushing roughly 300 writes a second in the test
>>
>> the only debug info in the log is the following:
>>
>> [Jun 21 19:33:29] [CRITICAL/client-protocol.c:218/call_bail()] 
>> client/protocol:bailing transport
>> --clipped--
>> [Jun 21 19:33:29] 
>> [ERROR/client-protocol.c:204/client_protocol_xfer()] 
>> protocol/client:transport_submit failed
>>
>> I'm going to setup the debug xlator tomorrow if no one has anything 
>> off the tops of their heads about what might be wrong
>>
>> we haven't tested heavy read load yet, just writes
>> we have managed to cause it multiple times, but haven't pinned down a 
>> cause as the debug logging all spits out basically the same material
>>
>> the client also has fairly high CPU usage during the test, roughly 
>> 90% of the core its on
>>
>>
>> _______________________________________________
>> Gluster-devel mailing list
>> Gluster-devel at nongnu.org
>> http://lists.nongnu.org/mailman/listinfo/gluster-devel
>>
>