[Gluster-devel] (client) reconnect attempts cause load spike

Anand Avati avati at zresearch.com
Mon Jul 30 21:16:50 UTC 2007


Rhesa,
the random bail+reconnection bug has been fixed. please let us know in case
you come across any issues.

thanks,
avati

2007/7/8, Rhesa Rozendaal <gluster at rhesa.com>:
>
> I just came back from dinner to find the client (patch-284, same spec file
> as
> before) pegging the machine:
>
> 13268 root      16   0  601m 527m  912 S  100 26.1 174:05.69 [glusterfs]
>
> (the load on the machine was 35, so the 100% cpu doesn't really do it
> justice)
>
> Here's the relevant portion of the server log:
>
> 2007-07-07 21:03:45 E [protocol.c:262:gf_block_unserialize_transport]
> libglusterfs/protocol: full_read of header failed: peer (192.168.64.186)
> 2007-07-07 21:03:45 C [tcp.c:81:tcp_disconnect] server: connection
> disconnected
> 2007-07-07 21:05:45 E [protocol.c:262:gf_block_unserialize_transport]
> libglusterfs/protocol: full_read of header failed: peer (192.168.64.186)
> 2007-07-07 21:05:45 C [tcp.c:81:tcp_disconnect] server: connection
> disconnected
> 2007-07-07 21:09:45 E [protocol.c:262:gf_block_unserialize_transport]
> libglusterfs/protocol: full_read of header failed: peer (192.168.64.186)
> 2007-07-07 21:09:45 C [tcp.c:81:tcp_disconnect] server: connection
> disconnected
>
> (it actually started showing those from 20:33, repeating the same block
> once
> every 4 minutes or so).
>
> The client log is probably more interesting:
>
> 2007-07-07 21:09:47 W [client-protocol.c:207:call_bail] client31:
> activating
> bail-out. pending frames = 1. last sent = 2007-07-07 21:05:47. last
> received =
> 1970-01-01 00:00:00transport-timeout = 120
> 2007-07-07 21:09:47 C [client-protocol.c:215:call_bail] client31: bailing
> transport
> 2007-07-07 21:09:47 D [tcp.c:131:cont_hand] tcp: forcing poll/read/write
> to
> break on blocked socket (if any)
> 2007-07-07 21:09:47 W [client-protocol.c:4043:client_protocol_cleanup]
> client31: cleaning up state in transport object 0x8690028
> 2007-07-07 21:09:47 W [client-protocol.c:4092:client_protocol_cleanup]
> client31: forced unwinding frame type(1) op(2)
> 2007-07-07 21:09:47 C [tcp.c:81:tcp_disconnect] client31: connection
> disconnected
> 2007-07-07 21:09:47 E [client-protocol.c:328:client_protocol_xfer]
> client31:
> transport_submit failed
> 2007-07-07 21:09:47 W [client-protocol.c:4012:client_protocol_reconnect]
> client31: attempting reconnect
> 2007-07-07 21:09:47 D [tcp-client.c:71:tcp_connect] client31: socket fd =
> 4
> 2007-07-07 21:09:47 D [tcp-client.c:89:tcp_connect] client31: finalized on
> port `1018'
> 2007-07-07 21:09:47 D [tcp-client.c:110:tcp_connect] client31: defaulting
> remote-port to 6996
> 2007-07-07 21:09:47 D [tcp-client.c:142:tcp_connect] client31: connect on
> 4 in
> progress (non-blocking)
> 2007-07-07 21:09:47 D [tcp-client.c:186:tcp_connect] client31: connection
> on 4
> success
> 2007-07-07 21:09:47 D [client-protocol.c:4578:notify] client31: got
> GF_EVENT_CHILD_UP
> 2007-07-07 21:09:47 D [client-protocol.c
> :4341:client_protocol_handshake_reply]
> client31: reply frame has callid: 424242
> 2007-07-07 21:09:47 D [client-protocol.c
> :4375:client_protocol_handshake_reply]
> client31: SETVOLUME on remote-host succeeded
> 2007-07-07 21:09:48 W [client-protocol.c:4020:client_protocol_reconnect]
> client31: breaking reconnect chain
>
> That block repeats 3 times (with 4 minute intervals), after which I see
> about
> 600 lines like these:
> 2007-07-07 21:10:02 D [inode.c:236:__destroy_inode] fuse/inode: destroy
> inode(221481587)
> 2007-07-07 21:10:02 D [inode.c:236:__destroy_inode] fuse/inode: destroy
> inode(114199670)
> 2007-07-07 21:10:02 D [inode.c:236:__destroy_inode] fuse/inode: destroy
> inode(221481586)
> 2007-07-07 21:10:02 D [inode.c:236:__destroy_inode] fuse/inode: destroy
> inode(221481578)
>
> [snip 100 similar lines]
>
> 2007-07-07 21:10:02 D [client-protocol.c:1829:client_forget] client01:
> forget
> inode (221284204)
> 2007-07-07 21:10:02 D [inode.c:236:__destroy_inode] client01/inode:
> destroy
> inode(221284204)
> 2007-07-07 21:10:02 D [client-protocol.c:1829:client_forget] client03:
> forget
> inode (355043120)
> 2007-07-07 21:10:02 D [inode.c:236:__destroy_inode] client03/inode:
> destroy
> inode(355043120)
> 2007-07-07 21:10:02 D [client-protocol.c:1829:client_forget] client02:
> forget
> inode (84248694)
> 2007-07-07 21:10:02 D [inode.c:236:__destroy_inode] client02/inode:
> destroy
> inode(84248694)
> 2007-07-07 21:10:02 D [client-protocol.c:1829:client_forget] client31:
> forget
> inode (92176667)
> 2007-07-07 21:10:02 D [inode.c:236:__destroy_inode] client31/inode:
> destroy
> inode(92176667)
> 2007-07-07 21:10:02 D [client-protocol.c:1829:client_forget] ns: forget
> inode
> (221284203)
> 2007-07-07 21:10:02 D [inode.c:236:__destroy_inode] ns/inode: destroy
> inode(221284203)
> 2007-07-07 21:10:02 D [inode.c:236:__destroy_inode] export/inode: destroy
> inode(221284203)
>
> [snip 500 similar lines]
>
> Note that those were all in the same second.
>
> Then 3 minutes silence, after which the "activating bail-out" sequence
> starts
> again, all the way through the forget/destroy inode block.
>
> By now, I've killed the client, and will restart the "cp -ru" process I
> was
> running. I'll report back if anything happens.
>
> I doubt this is enough info for you to figure this out, so let me know if
> there's anything I can do to help pinpoint this odd reconnection behavior.
>
> Rhesa
>
>
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at nongnu.org
> http://lists.nongnu.org/mailman/listinfo/gluster-devel
>



-- 
Anand V. Avati



More information about the Gluster-devel mailing list