[Bugs] [Bug 1436595] New: Brick Multiplexing: A brick going down will result in all the bricks sharing the same PID go down

Tue Mar 28 09:19:38 UTC 2017

https://bugzilla.redhat.com/show_bug.cgi?id=1436595

            Bug ID: 1436595
           Summary: Brick Multiplexing: A brick going down will result in
                    all the bricks sharing the same PID go down
           Product: GlusterFS
           Version: 3.10
         Component: core
          Severity: urgent
          Assignee: bugs at gluster.org
          Reporter: nchilaka at redhat.com
                CC: bugs at gluster.org

Description of problem:
=======================
I umounted one of the LV hosting a gluster brick , this resulted in all bricks
going offline on that node.
This is a very serious issue given that a brick can go offline for different
purposes , like say a xfs corruption or disk failure, etc .but that should be
isolated instead of bringing down all the other bricks. Note that I am NOT
killing a PID of the brick.
Had a 3 node setup, with each having 4 thin-LVs being used to host gluster
bricks 
say the LVs are mounted on /rhs/brick{1..4}
Brick multiplexing is enabled
I create about 3 volumes as below
V1->1x2->n1:b1 n2:b1
v2->2x2->n1:b2 n2:b2 n1:b3 n2:b2
v3->1x3->n1:b4 n2:b4 n3:b4

Now I unmounted the b4 so as to bring only one brick of v3 offline using umount
-l
this resulted in all the bricks on node1 go offline as below

[root at dhcp35-192 bricks]# gluster v status
Status of volume: distrep
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.70.35.192:/rhs/brick2/distrep      N/A       N/A        N       N/A  
Brick 10.70.35.214:/rhs/brick2/distrep      49154     0          Y       20321
Brick 10.70.35.192:/rhs/brick3/distrep      N/A       N/A        N       N/A  
Brick 10.70.35.215:/rhs/brick3/distrep      49154     0          Y       13393
Self-heal Daemon on localhost               N/A       N/A        Y       6007 
Self-heal Daemon on 10.70.35.214            N/A       N/A        Y       20583
Self-heal Daemon on 10.70.35.215            N/A       N/A        Y       13643

Task Status of Volume distrep
------------------------------------------------------------------------------
There are no active volume tasks

Status of volume: spencer
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.70.35.192:/rhs/brick4/spencer      N/A       N/A        N       N/A  
Brick 10.70.35.214:/rhs/brick4/spencer      49154     0          Y       20321
Brick 10.70.35.215:/rhs/brick4/spencer      49154     0          Y       13393
Self-heal Daemon on localhost               N/A       N/A        Y       6007 
Self-heal Daemon on 10.70.35.214            N/A       N/A        Y       20583
Self-heal Daemon on 10.70.35.215            N/A       N/A        Y       13643

Task Status of Volume spencer
------------------------------------------------------------------------------
There are no active volume tasks

note that I did a umount of brick3 "umount  -l /rhs/brick3" which was being
used by distrep volume for second dht-subvol

Version-Release number of selected component (if applicable):

How reproducible:
[root at dhcp35-192 bricks]# rpm -qa|grep glust
glusterfs-fuse-3.10.0-1.el7.x86_64
glusterfs-rdma-3.10.0-1.el7.x86_64
glusterfs-libs-3.10.0-1.el7.x86_64
glusterfs-client-xlators-3.10.0-1.el7.x86_64
glusterfs-api-3.10.0-1.el7.x86_64
glusterfs-server-3.10.0-1.el7.x86_64
glusterfs-debuginfo-3.10.0-1.el7.x86_64
glusterfs-3.10.0-1.el7.x86_64
glusterfs-cli-3.10.0-1.el7.x86_64

Steps to Reproduce:
1.have a 2 or more node setup with multiple disks(or say LVs) for using them as
bricks
2.create 2 or more volumes of any type such that each node hosts atleast one
brick of each volume. Make sure that none of the brick is hosted on the same
path (ie same LV or physical device)
3.now bring down one brick by doing a disk down or umount the lv

Actual results:
=======
all bricks on that node go down, which are associated with same PID

Expected results:
===========
the brick down should not result in all other bricks going down

-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.