[Bugs] [Bug 1004546] peer probe can deadlock in "Sent and Received peer request" for both servers after server build
bugzilla at redhat.com
bugzilla at redhat.com
Mon Jan 30 06:11:29 UTC 2017
https://bugzilla.redhat.com/show_bug.cgi?id=1004546
Atin Mukherjee <amukherj at redhat.com> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |amukherj at redhat.com,
| |todd+rhbugs at stansell.org
Flags| |needinfo?(todd+rhbugs at stans
| |ell.org)
--- Comment #8 from Atin Mukherjee <amukherj at redhat.com> ---
(In reply to Todd Stansell from comment #0)
> Description of problem:
>
> Occasionally after rebuilding a node in a replica 2 cluster, the initial
> peer probe from the rebuilt node will cause both peers to be in a "Sent and
> Received peer request" state, never exchanging volume information and
> letting the rebuilt node move to the "Peer in Cluster" state.
>
> The only way out of this I've found is to stop glusterd on both nodes,
> remove the state= parameter from the /var/lib/glusterd/peers/<uuid> file and
> then start glusterd up again. After starting glusterd, the negotiation
> between the two start from the "Establishing Connection" state and things
> work as expected.
>
> Version-Release number of selected component (if applicable):
> 3.4.0
>
> How reproducible:
> It seems to happen every time I change which host is being rebuilt, but not
> if I rebuild the same node. I'm not 100% sure of this pattern, but it seems
> this way.
>
> Steps to Reproduce:
> 1. begin with a replica 2 cluster
> 2. shut down services and kickstart one server
> 3. restore previous uuid in /var/lib/glusterd/glusterd.info
Why are we trying to restore previous UUID? If it's a fresh set up then you
should retain the original UUIDs.
> 4. start glusterd
> 5. run: gluster peer probe $peer
> 6. restart glusterd
>
> Actual results:
>
> peer status shows both peers in "Sent and Received peer request" as they
> both seem to wait for an ACC from the other side.
>
> Expected results:
>
> peer should end up in "Peer in Cluster" state, with volume information
> exchanged and bricks started.
>
> Additional info:
>
> In our situation, we've written kickstart scripts to automate the peer probe
> and rejoining of the cluster. During kickstart we preserve the uuid of the
> server (step 3) and then set up an init script to run soon after glusterd
> starts upon the first boot. The script that gets generated in our test
> while rebuilding admin02.mgmt is as follows (to see our exact steps):
>
> #!/bin/bash
> # initialize glusterfs config
> #
>
> # Source function library.
> . /etc/init.d/functions
>
> me=admin02.mgmt
> peer=admin01.mgmt
> gluster peer probe $peer
> for i in 1 2 3 4 5; do
> echo -n "Checking for Peer in Cluster .. $i .. "
> out=`gluster peer status 2>/dev/null | grep State:`
> echo $out
> if echo $out | grep "Peer in Cluster" >/dev/null; then
> break
> fi
> sleep 1
> done
> # restart glusterd after we've attempted a probe
> service glusterd restart
>
> for i in 1 2 3 4 5; do
> echo "Checking for volume info .. $i"
> out=`gluster volume info 2>/dev/null | grep -v "^No "`
> if [ -n "$out" ] ; then
> break
> fi
> sleep 1
> done
> #----------------------------------------------------
>
> One of the failures we've observed showed the following on the console:
>
> Running /etc/rc3.d/S21glusterfs-init start
> peer probe: success
> Checking for Peer in Cluster .. 1 .. State: Accepted peer request
> (Connected)
> Checking for Peer in Cluster .. 2 .. State: Accepted peer request
> (Connected)
> Checking for Peer in Cluster .. 3 .. State: Accepted peer request
> (Connected)
> Checking for Peer in Cluster .. 4 .. State: Accepted peer request
> (Connected)
> Checking for Peer in Cluster .. 5 .. State: Accepted peer request
> (Connected)
> Stopping glusterd:[ OK ]
> Starting glusterd:[ OK ]
> Checking for volume info .. 1
> Checking for volume info .. 2
> Checking for volume info .. 3
> Checking for volume info .. 4
> Checking for volume info .. 5
>
> After this, if we look at peer status, it shows both nodes in the "Sent and
> Received peer request" status.
>
> One of the times this procedure works, we get the following output:
>
> Running /etc/rc3.d/S21glusterfs-init start
> peer probe: success
> Checking for Peer in Cluster .. 1 .. State: Accepted peer request
> (Connected)
> Checking for Peer in Cluster .. 2 .. State: Accepted peer request
> (Connected)
> Checking for Peer in Cluster .. 3 .. State: Accepted peer request
> (Connected)
> Checking for Peer in Cluster .. 4 .. State: Accepted peer request
> (Connected)
> Checking for Peer in Cluster .. 5 .. State: Accepted peer request
> (Connected)
> Stopping glusterd:[ OK ]
> Starting glusterd:[ OK ]
> Checking for volume info .. 1
> Checking for volume info .. 2
>
> And at this point, it joins the cluster and starts the bricks.
>
> I will include attachments of the etc-glusterfs-glusterd logs in DEBUG mode
> from both servers in 3 different situations to try to help understand what's
> going on.
>
> * The logs with -194603 suffix are from the failed attempt above to
> kickstart admin02.
> * The logs with -200414 are after I shut down glusterd on both nodes and
> removed state= from the peer files, causing them to start over and join the
> cluster.
> * The logs with -215614 are a second full kickstart of admin02 where it
> succeeded as expected.
>
> The only pattern I can find to this is that when I switch which node is
> getting kickstarted it seems to fail all the time. If I continue to
> kickstart the same node, it seems to continue to succeed.
>
> Todd
--
You are receiving this mail because:
You are on the CC list for the bug.
Unsubscribe from this bug https://bugzilla.redhat.com/token.cgi?t=iDmrUN2CUp&a=cc_unsubscribe
More information about the Bugs
mailing list