[Gluster-devel] Race in protocol/client and RPC

Mon Jan 29 22:07:55 UTC 2018

Hi all,

I've identified a race in RPC layer that caused some spurious
disconnections and CHILD_DOWN notifications.

The problem happens when protocol/client reconfigures a connection to move
from glusterd to glusterfsd. This is done by calling rpc_clnt_reconfig()
followed by rpc_transport_disconnect().

This seems fine because client_rpc_notify() will call
rpc_clnt_cleanup_and_start() when the disconnect notification is received.
However There's a problem.

Suppose that the disconnection notification has been executed and we are
just about to call rpc_clnt_cleanup_and_start(). If at this point the
reconnection timer is fired, rpc_clnt_reconnect() will be processed. This
will cause the socket to be reconnected and a connection notification will
be processed. Then a handshake request will be sent to the server.

However, when rpc_clnt_cleanup_and_start() continues, all sent XID's are
deleted. When we receive the answer from the handshake, we are unable to
map the XID, making the request to fail. So the handshake fails and the
client is considered down, sending a CHILD_DOWN notification to upper
xlators.

This causes, in some tests, to start processing things while a brick is
down unexpectedly, causing spurious failures on the test.

To solve the problem I've forced the rpc_clnt_reconfig() function to
disable the RPC connection using similar code to rcp_clnt_disable(). This
prevents the background rpc_clnt_reconnect() timer to be executed, avoiding
the problem.

This seems to work fine for many tests, but it seems to be causing some
issue in gfapi based tests. I'm still investigating this.

Xavi
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-devel/attachments/20180129/f9e4baa2/attachment.html>