[Gluster-devel] 3.4.0alpha3 spurious disconnect
Emmanuel Dreyfus
manu at netbsd.org
Thu Apr 18 06:08:50 UTC 2013
I still get spurious disconnects in 3.4.0alpha3. While there I note this
patch that has not beeen pulled up to 3.4 branch, while it fixes a
problem I envountered on alpha2:
http://review.gluster.com/#/c/4588/
Here is the first occurence of a spurious disconnect on client side (I
added debug messages)
[2013-04-17 21:07:47.198612] E [socket.c:487:__socket_rwv]
0-gfs33-client-2: EOF on socket (errno = 0, opcount = 1,
opvector[0].iov_len = 4
[2013-04-17 21:07:47.198824] W [socket.c:515:__socket_rwv]
0-gfs33-client-2: readv failed (No message available)
[2013-04-17 21:07:47.198947] W
[socket.c:1963:__socket_proto_state_machine] 0-gfs33-client-2:
reading from socket failed. Error (No message available), peer
(192.0.2.103:49153)
[2013-04-17 21:07:47.199000] I [client.c:2097:client_rpc_notify]
0-gfs33-client-2: disconnected
[2013-04-17 21:07:47.266289] W
[client-rpc-fops.c:1640:client3_3_entrylk_cbk] 0-gfs33-client-2:
remote operation failed: Socket is not connected
In socket.c, EOF is decided because ret is 0. ret may come from
iov_load() or from readv(). I have not yet determined who is the
culprit.
On the brick side, I get this:
[2013-04-17 21:07:47.208168] E
[event-poll.c:346:event_dispatch_poll_handler] 0-poll: index
not found for fd=8 (idx_hint=5)
A tcpdump running at that time on brick side reports a TCP RST at
22:07:47.208163. I recall there glusterfs does not use local time,
therefore I think it should be 21:07:47.208163 for glusterfs.
There is also a small clock skew between client (offset -0.000732) and
brick (-0.006740), which means brick is 6008 µs behind the client. That
means the TCP reset happens after the ret = 0 in socket.c:487, as I
understand. I therefore strongly suspect iov_load().
Opinions? Any hint?
--
Emmanuel Dreyfus
http://hcpnet.free.fr/pubz
manu at netbsd.org
More information about the Gluster-devel
mailing list