[Bugs] [Bug 1716979] Multiple disconnect events being propagated for the same child

Tue Jun 4 13:52:16 UTC 2019

https://bugzilla.redhat.com/show_bug.cgi?id=1716979

Raghavendra G <rgowdapp at redhat.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |ASSIGNED

--- Comment #2 from Raghavendra G <rgowdapp at redhat.com> ---
The multiple disconnect events are due to reconnect/disconnect to glusterd
(port 24007). rpc/clnt has a reconnect feature which tries to reconnect to a
disconnected brick and client connection to brick is a two step process:
1. connect to glusterd, get brick port then disconnect
2. connect to brick

In this case step 1 would be successful and step 2 won't happen as glusterd
wouldn't send back the brick port (as brick is dead). Nevertheless there is a
chain of connect/disconnect (to glusterd) at rpc layer and these are valid
steps as we need reconnection logic. However subsequent disconnect events were
prevented from reaching parents of protocol/client as it remembered which was
the last sent and if current event is the same as last event, it would skip
notification. Before Halo replication feature -
https://review.gluster.org/16177, last_sent_event for this test case would be
GF_EVENT_DISCONNECT and hence subsequent disconnects were skipped notification
to parent xlators. But Halo replication introduced another event
GF_EVENT_CHILD_PING which gets notified to parents of protocol/client whenever
there is a successful ping response. In this case, the successful ping response
would be from glusterd and would change conf->last_sent_event to
GF_EVENT_CHILD_PING. This made subsequent disconnect events are not skipped.

A patch to propagate GF_EVENT_CHILD_PING only after a successful handshake
prevents spurious CHILD_DOWN events to afr. However, I am not sure whether this
breaks Halo replication. Would request afr team members comment on the patch
(I'll post shortly).

-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.