[Bugs] [Bug 1631128] rpc marks brick disconnected from glusterd & volume stop transaction gets timed out

Thu Sep 20 03:28:16 UTC 2018

https://bugzilla.redhat.com/show_bug.cgi?id=1631128

Atin Mukherjee <amukherj at redhat.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |POST

--- Comment #2 from Atin Mukherjee <amukherj at redhat.com> ---
RCA:

Turned out to be a brick multiplexing bug in glusterd.
When a volume is stopped, individual bricks get detached from the brick process
through brick op phase in the volume stop transaction and in the brick-op
callback the counter called "blockers" is decremented insider sync_lock &
sync_unlock () section. Before sending the rpc_submit the same counter is
incremented. The goal of this counter is to ensure at any given point of time,
before originating a commit phase of a transaction this counter is set to 0.
Now because of heketi distributes the volume stop command across different cli
& glusterd, the processing of these volume stop requests become parallel and
hence because of the additional context switches through sync_lock and
sync_unlock for updating the blocker counter, the overall time of wait process
goes up. Since RHGS 3.3.1 if any regular transactions take more than 180
seconds the lock timer is forcibly expired. Unfortunately in this situation
some of the transactions are kicked out while they are still in progress. This
is the reason why we see unlock timer callback log entries in glusterd log. I
believe this also had a ripple effect on the saved frames which is why we see
critical log like:

[2018-08-28 21:23:49.759511] C [rpc-clnt.c:449:rpc_clnt_fill_request_info]
0-management: cannot lookup the saved frame corresponding to xid (2170)

and this causes a disconnect to the original brick given the rpc connection for
all the attached bricks are a copy of the parent brick connection.

The need of updating the blockers counter under synclock is an overkill.
Updating it through GF_ATOMIC API should be sufficient for the atomicity
guarantee.

upstream patch : https://review.gluster.org/#/c/glusterfs/+/21221/

-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.