<div dir="ltr"><div><div>Hey Atin,<br><br></div>This is happening because of bringing down the glusterd on the third node before doing the replcae brick.<br></div><div>In replace brick we do a temporary mount to mark pending xattr on the source bricks saying that the brick being replaced is sink.<br></div><div>But in this case, since one of the source brick&#39;s glusterd is down, when the mount tries to get the port at which the brick is listening,<br>it fails to get that leading to failure of setting the &quot;trusted.replace_brick&quot; attribute.<br>For replica 3 volume to say any fop as success it needs at least quorum number of success. Hence the replace brick fails.<br><br></div><div>On the QE setup the replace brick would have succeeded only because of some race between glusterd going down and replace brick happening.<br></div><div>Otherwise there is no chance for replace brick to succeed.<br><br></div><div>Regards,<br></div><div>Karthik<br></div></div><div class="gmail_extra"><br><div class="gmail_quote">On Tue, Mar 27, 2018 at 7:25 PM, Atin Mukherjee <span dir="ltr">&lt;<a href="mailto:amukherj@redhat.com" target="_blank">amukherj@redhat.com</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div><div>While writing a test for the patch fix of BZ <a href="https://bugzilla.redhat.com/show_bug.cgi?id=1560957" target="_blank">https://bugzilla.redhat.com/<wbr>show_bug.cgi?id=1560957</a> I just can&#39;t make my test case to pass where a replace brick commit force always fails on a multi node cluster and that&#39;s on the latest mainline code.<br><br></div><div><b>The fix is a one liner:<br></b><br>atin@dhcp35-96:~/codebase/<wbr>upstream/glusterfs_master/<wbr>glusterfs$ gd HEAD~1<br>diff --git a/xlators/mgmt/glusterd/src/<wbr>glusterd-utils.c b/xlators/mgmt/glusterd/src/<wbr>glusterd-utils.c<br>index af30756c9..24d813fbd 100644<br>--- a/xlators/mgmt/glusterd/src/<wbr>glusterd-utils.c<br>+++ b/xlators/mgmt/glusterd/src/<wbr>glusterd-utils.c<br>@@ -5995,6 +5995,7 @@ glusterd_brick_start (glusterd_volinfo_t *volinfo,<br>                          * TBD: re-use RPC connection across bricks<br>                          */<br>                         if (is_brick_mx_enabled ()) {<br>+                             <wbr>   brickinfo-&gt;port_registered = _gf_true;<br>                              <wbr>   ret = glusterd_get_sock_from_brick_<wbr>pid (pid, socketpath,<br>                              <wbr>                              <wbr>             sizeof(socketpath));<br>                              <wbr>   if (ret) {<br><br></div><div><br></div><br><br><b>The test does the following:</b><br><br>#!/bin/bash                   <wbr>                              <wbr>                       <br>                              <wbr>                              <wbr>                       <br>. $(dirname $0)/../../include.rc          <wbr>                              <wbr>           <br>. $(dirname $0)/../../cluster.rc          <wbr>                              <wbr>           <br>. $(dirname $0)/../../volume.rc           <wbr>                              <wbr>           <br>                              <wbr>                              <wbr>                       <br>                              <wbr>                              <wbr>                       <br>cleanup;                      <wbr>                              <wbr>                       <br>                              <wbr>                              <wbr>                       <br>TEST launch_cluster 3;                            <wbr>                              <wbr>   <br>                              <wbr>                              <wbr>                       <br>TEST $CLI_1 peer probe $H2;                          <wbr>                              <br>EXPECT_WITHIN $PROBE_TIMEOUT 1 peer_count                    <wbr>                      <br>                              <wbr>                              <wbr>                       <br>TEST $CLI_1 peer probe $H3;                          <wbr>                              <br>EXPECT_WITHIN $PROBE_TIMEOUT 2 peer_count                    <wbr>                      <br>                              <wbr>                              <wbr>                       <br>TEST $CLI_1 volume set all cluster.brick-multiplex on                            <wbr>  <br>                              <wbr>                              <wbr>                       <br>TEST $CLI_1 volume create $V0 replica 3 $H1:$B1/${V0}1 $H2:$B2/${V0}1 $H3:$B3/${V0}1 <br>                              <wbr>                              <wbr>                       <br>TEST $CLI_1 volume start $V0                           <wbr>                            <br>EXPECT_WITHIN $PROCESS_UP_TIMEOUT &quot;1&quot; brick_up_status_1 $V0 $H1 $B1/${V0}1         <br>EXPECT_WITHIN $PROCESS_UP_TIMEOUT &quot;1&quot; brick_up_status_1 $V0 $H2 $B2/${V0}1         <br>EXPECT_WITHIN $PROCESS_UP_TIMEOUT &quot;1&quot; brick_up_status_1 $V0 $H3 $B3/${V0}1         <br>                              <wbr>                              <wbr>                       <br>                              <wbr>                              <wbr>                       <br>#bug-1560957 - replace brick followed by an add-brick in a brick mux setup         <br>#brings down one brick instance                      <wbr>                              <br>                              <wbr>                              <wbr>                       <br>kill_glusterd 3                             <wbr>                              <wbr>         <br>EXPECT_WITHIN $PROBE_TIMEOUT 1 peer_count                    <wbr>                      <br>TEST $CLI_1 volume replace-brick $V0 $H1:$B1/${V0}1 $H1:$B1/${V0}1_new commit force <br><br><b>this is where the test always fails saying &quot;volume replace-brick: failed: Commit failed on localhost. Please check log file for details.<br></b>                              <wbr>                              <wbr>                       <br>TEST $glusterd_3                   <wbr>                              <wbr>                  <br>EXPECT_WITHIN $PROBE_TIMEOUT 2 peer_count                    <wbr>                      <br>                              <wbr>                              <wbr>                       <br>TEST $CLI_1 volume add-brick $V0 replica 3 $H1:$$B1/${V0}3 $H2:$B1/${V0}3 $H3:$B1/${V0}3 commit force<br>                              <wbr>                              <wbr>                       <br>EXPECT_WITHIN $PROCESS_UP_TIMEOUT &quot;1&quot; brick_up_status_1 $V0 $H3 $H3:$B1/${V0}1  <br>cleanup;   <br><br>glusterd log from 1st node <br>[2018-03-27 13:11:58.630845] E [MSGID: 106053] [glusterd-utils.c:13889:<wbr>glusterd_handle_replicate_<wbr>brick_ops] 0-management: Failed to set extended attribute trusted.replace-brick : Transport endpoint is not connected [Transport endpoint is not connected]<br><br></div>Request some help/attention from AFR folks.<br></div>
</blockquote></div><br></div>