[Bugs] [Bug 1703343] Bricks fail to come online after node reboot on a scaled setup

Fri Apr 26 07:48:16 UTC 2019

https://bugzilla.redhat.com/show_bug.cgi?id=1703343

--- Comment #1 from Mohit Agrawal <moagrawa at redhat.com> ---
Multiple bricks are spawned on a node if the node is reboot during 
volumes starting from another node in the cluster

Reproducer steps
1) Setup a cluster of 3 nodes
2) Enable brick_mux and create and start 50 volumes from node 1
3) Stop all the volumes from any node
4) Start all the volumes from node 2 after put 1 sec delay
   for i in {1..50}; do gluster v start testvol$i --mode=script; sleep 1; done
5) At the time of volumes are starting on node 2 run command on node 1
   pkill -f gluster; glusterd
6) Wait some time to finish volumes startups and check the no. of glusterfsd
   are running on node1.

RCA: At the time of glusterd starts it gets friend update request from a peer 
     node and has version changes for the volumes those are started when
     node was down.glusterd deletes volfile and reference for old version
volumes
     from glusterd internal data structures and create new volfile.glusterd was
not
     able to attached volume because data structure changes were happening
after brick
     start so data was going through RPC packet in attach request was not
correct and
     brick process sending disconnect to glusterd then glusterd try to spawn a
new 
     brick so multiple brick processes are spawned

Regards,
Mohit Agrawal

-- 
You are receiving this mail because:
You are on the CC list for the bug.