[Bugs] [Bug 1455907] heal info shows the status of the bricks as " Transport endpoint is not connected" though bricks are up

Fri May 26 12:22:02 UTC 2017

https://bugzilla.redhat.com/show_bug.cgi?id=1455907

--- Comment #1 from Atin Mukherjee <amukherj at redhat.com> ---
Description of problem:
=======================
heal info command output shows the status of the bricks as  "Transport endpoint
is not connected" though bricks are up and running.

Version-Release number of selected component (if applicable):
mainline

How reproducible:
=================
always

Steps to Reproduce:
===================
1) Create a Distributed-Replicate volume and enable brick mux on it.
2) Start the volume and FUSE mount it on a client.
3) Set cluster.self-heal-daemon to off.
4) Create a 10 directory on the mount point.
5) Kill one brick of one of the replica sets in the volume and modify the
permissions of all directories.
6) Start volume with force option.
7) Kill the other brick in the same replica set and modify permissions of the
directory again.
8) Start volume with force option. Examine the output of `gluster volume heal
<vol-name> info' command on the server.

Actual results:
===============
heal info command output shows the status of the bricks as  "Transport endpoint
is not connected" though bricks are up and running.

RCA:

When we stop the volume GlusterD actually sends two terminate request to brick
process one during brick op phase and another during commit phase. Without
multiplexing, it wasn't causing any problem, because the process was supposed
to stop. But with multiplexing, it is just a detach which will be executed
twice. Now those two requests can be executed at the same time, if that happens
we may delete the graph entry twice as we are not taking any locks during the
link modification of graph in glusterfs_handle_detach.

So the linked list will me moved twice which is results a deletion of an
independent brick.

--- Additional comment from Worker Ant on 2017-05-23 11:31:50 EDT ---

REVIEW: https://review.gluster.org/17374 (glusterfsd: process attach and detach
request inside lock) posted (#1) for review on master by Atin Mukherjee
(amukherj at redhat.com)

--- Additional comment from Worker Ant on 2017-05-23 11:32:29 EDT ---

REVIEW: https://review.gluster.org/17374 (glusterfsd: process attach and detach
request inside lock) posted (#2) for review on master by Atin Mukherjee
(amukherj at redhat.com)

--- Additional comment from Worker Ant on 2017-05-24 03:48:04 EDT ---

REVIEW: https://review.gluster.org/17374 (glusterfsd: process attach and detach
request inside lock) posted (#3) for review on master by Atin Mukherjee
(amukherj at redhat.com)

--- Additional comment from Worker Ant on 2017-05-24 10:50:37 EDT ---

REVIEW: https://review.gluster.org/17374 (glusterfsd: process attach and detach
request inside lock) posted (#4) for review on master by Atin Mukherjee
(amukherj at redhat.com)

--- Additional comment from Worker Ant on 2017-05-25 08:32:46 EDT ---

REVIEW: https://review.gluster.org/17374 (glusterfsd: process attach and detach
request inside lock) posted (#5) for review on master by Atin Mukherjee
(amukherj at redhat.com)

--- Additional comment from Worker Ant on 2017-05-26 08:11:32 EDT ---

COMMIT: https://review.gluster.org/17374 committed in master by Jeff Darcy
(jeff at pl.atyp.us) 
------
commit 3ca5ae2f3bff2371042b607b8e8a218bf316b48c
Author: Atin Mukherjee <amukherj at redhat.com>
Date:   Fri May 19 21:04:53 2017 +0530

    glusterfsd: process attach and detach request inside lock

    With brick multiplexing, there is a high possibility that attach and
    detach requests might be parallely processed and to avoid a concurrent
    update to the same graph list, a mutex lock is required.

    Credits : Rafi (rkavunga at redhat.com) for the RCA of this issue

    Change-Id: Ic8e6d1708655c8a143c5a3690968dfa572a32a9c
    BUG: 1454865
    Signed-off-by: Atin Mukherjee <amukherj at redhat.com>
    Reviewed-on: https://review.gluster.org/17374
    Smoke: Gluster Build System <jenkins at build.gluster.org>
    NetBSD-regression: NetBSD Build System <jenkins at build.gluster.org>
    CentOS-regression: Gluster Build System <jenkins at build.gluster.org>
    Reviewed-by: Jeff Darcy <jeff at pl.atyp.us>

-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.