[Bugs] [Bug 1004546] peer probe can deadlock in "Sent and Received peer request" for both servers after server build

Mon Jan 30 06:11:29 UTC 2017

https://bugzilla.redhat.com/show_bug.cgi?id=1004546

Atin Mukherjee <amukherj at redhat.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |amukherj at redhat.com,
                   |                            |todd+rhbugs at stansell.org
              Flags|                            |needinfo?(todd+rhbugs at stans
                   |                            |ell.org)

--- Comment #8 from Atin Mukherjee <amukherj at redhat.com> ---
(In reply to Todd Stansell from comment #0)
> Description of problem:
> 
> Occasionally after rebuilding a node in a replica 2 cluster, the initial
> peer probe from the rebuilt node will cause both peers to be in a "Sent and
> Received peer request" state, never exchanging volume information and
> letting the rebuilt node move to the "Peer in Cluster" state. 
> 
> The only way out of this I've found is to stop glusterd on both nodes,
> remove the state= parameter from the /var/lib/glusterd/peers/<uuid> file and
> then start glusterd up again.  After starting glusterd, the negotiation
> between the two start from the "Establishing Connection" state and things
> work as expected.
> 
> Version-Release number of selected component (if applicable):
> 3.4.0
> 
> How reproducible:
> It seems to happen every time I change which host is being rebuilt, but not
> if I rebuild the same node.  I'm not 100% sure of this pattern, but it seems
> this way.
> 
> Steps to Reproduce:
> 1. begin with a replica 2 cluster
> 2. shut down services and kickstart one server
> 3. restore previous uuid in /var/lib/glusterd/glusterd.info

Why are we trying to restore previous UUID? If it's a fresh set up then you
should retain the original UUIDs.

> 4. start glusterd
> 5. run: gluster peer probe $peer
> 6. restart glusterd
> 
> Actual results:
> 
> peer status shows both peers in "Sent and Received peer request" as they
> both seem to wait for an ACC from the other side.
> 
> Expected results:
> 
> peer should end up in "Peer in Cluster" state, with volume information
> exchanged and bricks started.
> 
> Additional info:
> 
> In our situation, we've written kickstart scripts to automate the peer probe
> and rejoining of the cluster.  During kickstart we preserve the uuid of the
> server (step 3) and then set up an init script to run soon after glusterd
> starts upon the first boot.  The script that gets generated in our test
> while rebuilding admin02.mgmt is as follows (to see our exact steps):
> 
> #!/bin/bash
> # initialize glusterfs config
> #
> 
> # Source function library.
> . /etc/init.d/functions
> 
> me=admin02.mgmt
> peer=admin01.mgmt
> gluster peer probe $peer
> for i in 1 2 3 4 5; do
>     echo -n "Checking for Peer in Cluster .. $i .. "
>     out=`gluster peer status 2>/dev/null | grep State:`
>     echo $out
>     if echo $out | grep "Peer in Cluster" >/dev/null; then
>         break
>     fi
>     sleep 1
> done
> # restart glusterd after we've attempted a probe
> service glusterd restart
> 
> for i in 1 2 3 4 5; do
>     echo "Checking for volume info .. $i"
>     out=`gluster volume info 2>/dev/null | grep -v "^No "`
>     if [ -n "$out" ] ; then
>         break
>     fi
>     sleep 1
> done
> #----------------------------------------------------
> 
> One of the failures we've observed showed the following on the console:
> 
>   Running /etc/rc3.d/S21glusterfs-init start
>   peer probe: success
>   Checking for Peer in Cluster .. 1 .. State: Accepted peer request
> (Connected)
>   Checking for Peer in Cluster .. 2 .. State: Accepted peer request
> (Connected)
>   Checking for Peer in Cluster .. 3 .. State: Accepted peer request
> (Connected)
>   Checking for Peer in Cluster .. 4 .. State: Accepted peer request
> (Connected)
>   Checking for Peer in Cluster .. 5 .. State: Accepted peer request
> (Connected)
>   Stopping glusterd:[  OK  ]
>   Starting glusterd:[  OK  ]
>   Checking for volume info .. 1
>   Checking for volume info .. 2
>   Checking for volume info .. 3
>   Checking for volume info .. 4
>   Checking for volume info .. 5
> 
> After this, if we look at peer status, it shows both nodes in the "Sent and
> Received peer request" status.
> 
> One of the times this procedure works, we get the following output:
> 
>   Running /etc/rc3.d/S21glusterfs-init start
>   peer probe: success
>   Checking for Peer in Cluster .. 1 .. State: Accepted peer request
> (Connected)
>   Checking for Peer in Cluster .. 2 .. State: Accepted peer request
> (Connected)
>   Checking for Peer in Cluster .. 3 .. State: Accepted peer request
> (Connected)
>   Checking for Peer in Cluster .. 4 .. State: Accepted peer request
> (Connected)
>   Checking for Peer in Cluster .. 5 .. State: Accepted peer request
> (Connected)
>   Stopping glusterd:[  OK  ]
>   Starting glusterd:[  OK  ]
>   Checking for volume info .. 1
>   Checking for volume info .. 2
> 
> And at this point, it joins the cluster and starts the bricks.
> 
> I will include attachments of the etc-glusterfs-glusterd logs in DEBUG mode
> from both servers in 3 different situations to try to help understand what's
> going on.
> 
>   * The logs with -194603 suffix are from the failed attempt above to
> kickstart admin02.
>   * The logs with -200414 are after I shut down glusterd on both nodes and
> removed state= from the peer files, causing them to start over and join the
> cluster.
>   * The logs with -215614 are a second full kickstart of admin02 where it
> succeeded as expected.
> 
> The only pattern I can find to this is that when I switch which node is
> getting kickstarted it seems to fail all the time.  If I continue to
> kickstart the same node, it seems to continue to succeed.
> 
> Todd

-- 
You are receiving this mail because:
You are on the CC list for the bug.
Unsubscribe from this bug https://bugzilla.redhat.com/token.cgi?t=iDmrUN2CUp&a=cc_unsubscribe