[Gluster-devel] Race in protocol/client and RPC
Xavi Hernandez
jahernan at redhat.com
Thu Feb 1 14:55:27 UTC 2018
On Thu, Feb 1, 2018 at 2:48 PM, Shyam Ranganathan <srangana at redhat.com>
wrote:
> On 02/01/2018 08:25 AM, Xavi Hernandez wrote:
> > After having tried several things, it seems that it will be complex to
> > solve these races. All attempts to fix them have caused failures in
> > other connections. Since I've other work to do and it doesn't seem to be
> > causing serious failures in production, for now I'll leave this. I'll
> > retake this when I've more time.
>
> Xavi, convert the findings into a bug, and post the details there, so
> that it may be followed up? (if not already done)
>
I've just created this bug:
https://bugzilla.redhat.com/show_bug.cgi?id=1541032
> >
> > Xavi
> >
> > On Mon, Jan 29, 2018 at 11:07 PM, Xavi Hernandez <jahernan at redhat.com
> > <mailto:jahernan at redhat.com>> wrote:
> >
> > Hi all,
> >
> > I've identified a race in RPC layer that caused some spurious
> > disconnections and CHILD_DOWN notifications.
> >
> > The problem happens when protocol/client reconfigures a connection
> > to move from glusterd to glusterfsd. This is done by calling
> > rpc_clnt_reconfig() followed by rpc_transport_disconnect().
> >
> > This seems fine because client_rpc_notify() will call
> > rpc_clnt_cleanup_and_start() when the disconnect notification is
> > received. However There's a problem.
> >
> > Suppose that the disconnection notification has been executed and we
> > are just about to call rpc_clnt_cleanup_and_start(). If at this
> > point the reconnection timer is fired, rpc_clnt_reconnect() will be
> > processed. This will cause the socket to be reconnected and a
> > connection notification will be processed. Then a handshake request
> > will be sent to the server.
> >
> > However, when rpc_clnt_cleanup_and_start() continues, all sent XID's
> > are deleted. When we receive the answer from the handshake, we are
> > unable to map the XID, making the request to fail. So the handshake
> > fails and the client is considered down, sending a CHILD_DOWN
> > notification to upper xlators.
> >
> > This causes, in some tests, to start processing things while a brick
> > is down unexpectedly, causing spurious failures on the test.
> >
> > To solve the problem I've forced the rpc_clnt_reconfig() function to
> > disable the RPC connection using similar code to rcp_clnt_disable().
> > This prevents the background rpc_clnt_reconnect() timer to be
> > executed, avoiding the problem.
> >
> > This seems to work fine for many tests, but it seems to be causing
> > some issue in gfapi based tests. I'm still investigating this.
> >
> > Xavi
> >
> >
> >
> >
> > _______________________________________________
> > Gluster-devel mailing list
> > Gluster-devel at gluster.org
> > http://lists.gluster.org/mailman/listinfo/gluster-devel
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-devel/attachments/20180201/ee490b8a/attachment.html>
More information about the Gluster-devel
mailing list