[Bugs] [Bug 1541032] New: Races in network communications

Thu Feb 1 14:53:56 UTC 2018

https://bugzilla.redhat.com/show_bug.cgi?id=1541032

            Bug ID: 1541032
           Summary: Races in network communications
           Product: GlusterFS
           Version: mainline
         Component: rpc
          Assignee: bugs at gluster.org
          Reporter: jahernan at redhat.com
                CC: bugs at gluster.org

Description of problem:

Several races exist in RPC communications.

* When rpc_clnt_reconfig() is called to change the port (to switch from
glusterd to glusterfsd in protocol/client), a disconnection could be received
and a reconnect attempted before protocol/client calls
rpc_transport_disconnect(). This causes an spurious connection that will be
shortly closed, but it's enough to send a handshake request, which fails when
the socked is closed again and triggers a CHILD_DOWN event. This causes that
some volumes come online with less bricks than expected. This could cause some
unnecessary damage that self-heal will need to handle.

* After calling rpc_clnt_reconfig() and rpc_transport_disconnect(), an
rpc_clnt_notify() will be received to disconnect the client from glusterd. If
before calling rpc_clnt_cleanup_and_start(), the reconnect timer is triggered,
an issue similar to the previous one will happen.

* When rpc_clnt_notify() is called with RPC_TRANSPORT_CLEANUP,
rpc_clnt_destroy() is immediately called, but there could still be some timer
callbacks running and using resources from the connection.

* On rpc_clnt_remove_ping_timer_locked(), it can happen that the timer is
configured and it has just been triggered to be executed but the callback is
still not running. In this case the function will return 1, causing a call to
rpc_clnt_unref(). If that was the last reference, when the timer callback is
executed, it will access already destroyed resources.

Version-Release number of selected component (if applicable): mainline

How reproducible:

It's difficult.

Steps to Reproduce:
1.
2.
3.

Actual results:

Expected results:

Additional info:

IMO, one thing that makes it very hard to make it work as expected with a bad
control of what's happening, is that we are allowing high level clients to
directly call rpc_transport_disconnect() and other lower level functions
bypassing rpc-clnt. Considering that we can have multiple threads accessing the
same connection in different ways, it's difficult to correctly coordinate all
accesses.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.