<div dir="ltr"><div class="gmail_extra"><br><div class="gmail_quote">On 20 June 2017 at 20:38, Aravinda <span dir="ltr">&lt;<a href="mailto:avishwan@redhat.com" target="_blank">avishwan@redhat.com</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
  
    
  
  <div bgcolor="#FFFFFF" text="#000000"><span class="">
    On 06/20/2017 06:02 PM, Pranith Kumar Karampuri wrote:<br>
    <blockquote type="cite">
      <div dir="ltr">
        <div>
          <div>
            <div>Xavi, Aravinda and I had a discussion on #gluster-dev
              and we agreed to go with the format Aravinda suggested for
              now and in future we wanted some more changes for dht to
              detect which subvolume went down came back up, at that
              time we will revisit the solution suggested by Xavi.<br>
              <br>
            </div>
            Susanth is doing the dht changes<br>
          </div>
          Aravinda is doing geo-rep changes<br>
        </div>
      </div>
    </blockquote></span>
    Done. Geo-rep patch sent for review <a class="m_-881474676573699517moz-txt-link-freetext" href="https://review.gluster.org/17582" target="_blank">https://review.gluster.org/<wbr>17582</a><br>
    <br></div></blockquote><div><br></div><div>The proposed changes to the node-uuid behaviour (while good) are going to break tiering . Tiering changes will take a little more time to be coded and tested. </div><div><br></div><div>As this is a regression for 3.11 and a blocker for 3.11.1, I suggest we go back to the original node-uuid behaviour for now so as to unblock the release and target the proposed changes for the next 3.11 releases.</div><div><br></div><div><br></div><div>Regards,</div><div>Nithya</div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div bgcolor="#FFFFFF" text="#000000">
    --<br>
    Aravinda<div><div class="h5"><br>
    <br>
    <blockquote type="cite">
      <div dir="ltr">
        <div><br>
        </div>
        Thanks to all of you guys for the discussions!<br>
        <div>
          <div>
            <div>
              <div>
                <div class="gmail_extra"><br>
                  <div class="gmail_quote">On Tue, Jun 20, 2017 at 5:05
                    PM, Xavier Hernandez <span dir="ltr">&lt;<a href="mailto:xhernandez@datalab.es" target="_blank">xhernandez@datalab.es</a>&gt;</span>
                    wrote:<br>
                    <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi
                      Aravinda,<span><br>
                        <br>
                        On 20/06/17 12:42, Aravinda wrote:<br>
                        <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
                          I think following format can be easily adopted
                          by all components<br>
                          <br>
                          UUIDs of a subvolume are seperated by space
                          and subvolumes are separated<br>
                          by comma<br>
                          <br>
                          For example, node1 and node2 are replica with
                          U1 and U2 UUIDs<br>
                          respectively and<br>
                          node3 and node4 are replica with U3 and U4
                          UUIDs respectively<br>
                          <br>
                          node-uuid can return &quot;U1 U2,U3 U4&quot;<br>
                        </blockquote>
                        <br>
                      </span>
                      While this is ok for current implementation, I
                      think this can be insufficient if there are more
                      layers of xlators that require to indicate some
                      sort of grouping. Some representation that can
                      represent hierarchy would be better. For example:
                      &quot;(U1 U2) (U3 U4)&quot; (we can use spaces or comma as a
                      separator).<span><br>
                        <br>
                        <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
                          <br>
                          Geo-rep can split by &quot;,&quot; and then split by
                          space and take first UUID<br>
                          DHT can split the value by space or comma and
                          get unique UUIDs list<br>
                        </blockquote>
                        <br>
                      </span>
                      This doesn&#39;t solve the problem I described in the
                      previous email. Some more logic will need to be
                      added to avoid more than one node from each
                      replica-set to be active. If we have some explicit
                      hierarchy information in the node-uuid value, more
                      decisions can be taken.<br>
                      <br>
                      An initial proposal I made was this:<br>
                      <br>
                      DHT[2](AFR[2,0](NODE(U1), NODE(U2)),
                      AFR[2,0](NODE(U1), NODE(U2)))<br>
                      <br>
                      This is harder to parse, but gives a lot of
                      information: DHT with 2 subvolumes, each subvolume
                      is an AFR with replica 2 and no arbiters. It&#39;s
                      also easily extensible with any new xlator that
                      changes the layout.<br>
                      <br>
                      However maybe this is not the moment to do this,
                      and probably we could implement this in a new
                      xattr with a better name.<br>
                      <br>
                      Xavi
                      <div class="m_-881474676573699517HOEnZb">
                        <div class="m_-881474676573699517h5"><br>
                          <br>
                          <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
                            <br>
                            Another question is about the behavior when
                            a node is down, existing<br>
                            node-uuid xattr will not return that UUID if
                            a node is down. What is the<br>
                            behavior with the proposed xattr?<br>
                            <br>
                            Let me know your thoughts.<br>
                            <br>
                            regards<br>
                            Aravinda VK<br>
                            <br>
                            On 06/20/2017 03:06 PM, Aravinda wrote:<br>
                            <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
                              Hi Xavi,<br>
                              <br>
                              On 06/20/2017 02:51 PM, Xavier Hernandez
                              wrote:<br>
                              <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
                                Hi Aravinda,<br>
                                <br>
                                On 20/06/17 11:05, Pranith Kumar
                                Karampuri wrote:<br>
                                <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
                                  Adding more people to get a consensus
                                  about this.<br>
                                  <br>
                                  On Tue, Jun 20, 2017 at 1:49 PM,
                                  Aravinda &lt;<a href="mailto:avishwan@redhat.com" target="_blank">avishwan@redhat.com</a><br>
                                  &lt;mailto:<a href="mailto:avishwan@redhat.com" target="_blank">avishwan@redhat.com</a>&gt;&gt;
                                  wrote:<br>
                                  <br>
                                  <br>
                                      regards<br>
                                      Aravinda VK<br>
                                  <br>
                                  <br>
                                      On 06/20/2017 01:26 PM, Xavier
                                  Hernandez wrote:<br>
                                  <br>
                                          Hi Pranith,<br>
                                  <br>
                                          adding gluster-devel, Kotresh
                                  and Aravinda,<br>
                                  <br>
                                          On 20/06/17 09:45, Pranith
                                  Kumar Karampuri wrote:<br>
                                  <br>
                                  <br>
                                  <br>
                                              On Tue, Jun 20, 2017 at
                                  1:12 PM, Xavier Hernandez<br>
                                              &lt;<a href="mailto:xhernandez@datalab.es" target="_blank">xhernandez@datalab.es</a>
                                  &lt;mailto:<a href="mailto:xhernandez@datalab.es" target="_blank">xhernandez@datalab.es</a>&gt;<br>
                                              &lt;mailto:<a href="mailto:xhernandez@datalab.es" target="_blank">xhernandez@datalab.es</a><br>
                                              &lt;mailto:<a href="mailto:xhernandez@datalab.es" target="_blank">xhernandez@datalab.es</a>&gt;<wbr>&gt;&gt;
                                  wrote:<br>
                                  <br>
                                                  On 20/06/17 09:31,
                                  Pranith Kumar Karampuri wrote:<br>
                                  <br>
                                                      The way
                                  geo-replication works is:<br>
                                                      On each machine,
                                  it does getxattr of node-uuid and<br>
                                              check if its<br>
                                                      own uuid<br>
                                                      is present in the
                                  list. If it is present then it<br>
                                              will consider<br>
                                                      it active<br>
                                                      otherwise it will
                                  be considered passive. With this<br>
                                              change we are<br>
                                                      giving<br>
                                                      all uuids instead
                                  of first-up subvolume. So all<br>
                                              machines think<br>
                                                      they are<br>
                                                      ACTIVE which is
                                  bad apparently. So that is the<br>
                                              reason. Even I<br>
                                                      felt bad<br>
                                                      that we are doing
                                  this change.<br>
                                  <br>
                                  <br>
                                                  And what about
                                  changing the content of node-uuid to<br>
                                              include some<br>
                                                  sort of hierarchy ?<br>
                                  <br>
                                                  for example:<br>
                                  <br>
                                                  a single brick:<br>
                                  <br>
                                                  NODE(&lt;guid&gt;)<br>
                                  <br>
                                                  AFR/EC:<br>
                                  <br>
                                                 
                                  AFR[2](NODE(&lt;guid&gt;),
                                  NODE(&lt;guid&gt;))<br>
                                                 
                                  EC[3,1](NODE(&lt;guid&gt;),
                                  NODE(&lt;guid&gt;),
                                  NODE(&lt;guid&gt;))<br>
                                  <br>
                                                  DHT:<br>
                                  <br>
                                                 
                                  DHT[2](AFR[2](NODE(&lt;guid&gt;),
                                  NODE(&lt;guid&gt;)),<br>
                                              AFR[2](NODE(&lt;guid&gt;),<br>
                                                  NODE(&lt;guid&gt;)))<br>
                                  <br>
                                                  This gives a lot of
                                  information that can be used to<br>
                                  take the<br>
                                                  appropriate decisions.<br>
                                  <br>
                                  <br>
                                              I guess that is not
                                  backward compatible. Shall I CC<br>
                                              gluster-devel and<br>
                                              Kotresh/Aravinda?<br>
                                  <br>
                                  <br>
                                          Is the change we did backward
                                  compatible ? if we only require<br>
                                          the first field to be a GUID
                                  to support backward compatibility,<br>
                                          we can use something like
                                  this:<br>
                                  <br>
                                      No. But the necessary change can
                                  be made to Geo-rep code as well if<br>
                                      format is changed, Since all these
                                  are built/shipped together.<br>
                                  <br>
                                      Geo-rep uses node-id as follows,<br>
                                  <br>
                                      list = listxattr(node-uuid)<br>
                                      active_node_uuids =
                                  list.split(SPACE)<br>
                                      active_node_flag = True if
                                  self.node_id exists in
                                  active_node_uuids<br>
                                      else False<br>
                                </blockquote>
                                <br>
                                How was this case solved ?<br>
                                <br>
                                suppose we have three servers and 2
                                bricks in each server. A<br>
                                replicated volume is created using the
                                following command:<br>
                                <br>
                                gluster volume create test replica 2
                                server1:/brick1 server2:/brick1<br>
                                server2:/brick2 server3:/brick1
                                server3:/brick1 server1:/brick2<br>
                                <br>
                                In this case we have three replica-sets:<br>
                                <br>
                                * server1:/brick1 server2:/brick1<br>
                                * server2:/brick2 server3:/brick1<br>
                                * server3:/brick2 server2:/brick2<br>
                                <br>
                                Old AFR implementation for node-uuid
                                always returned the uuid of the<br>
                                node of the first brick, so in this case
                                we will get the uuid of the<br>
                                three nodes because all of them are the
                                first brick of a replica-set.<br>
                                <br>
                                Does this mean that with this
                                configuration all nodes are active ? Is<br>
                                this a problem ? Is there any other
                                check to avoid this situation if<br>
                                it&#39;s not good ?<br>
                              </blockquote>
                              Yes all Geo-rep workers will become Active
                              and participate in syncing.<br>
                              Since changelogs will have the same
                              information in replica bricks this<br>
                              will lead to duplicate syncing and
                              consuming network bandwidth.<br>
                              <br>
                              Node-uuid based Active worker is the
                              default configuration in Geo-rep<br>
                              till now, Geo-rep also has Meta Volume
                              based syncronization for Active<br>
                              worker using lock files.(Can be opted
                              using Geo-rep configuration,<br>
                              with this config node-uuid will not be
                              used)<br>
                              <br>
                              Kotresh proposed a solution to configure
                              which worker to become<br>
                              Active. This will give more control to
                              Admin to choose Active workers,<br>
                              This will become default configuration
                              from 3.12<br>
                              <a href="https://github.com/gluster/glusterfs/issues/244" rel="noreferrer" target="_blank">https://github.com/gluster/glu<wbr>sterfs/issues/244</a><br>
                              <br>
                              --<br>
                              Aravinda<br>
                              <br>
                              <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
                                <br>
                                Xavi<br>
                                <br>
                                <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
                                  <br>
                                  <br>
                                  <br>
                                          Bricks:<br>
                                  <br>
                                          &lt;guid&gt;<br>
                                  <br>
                                          AFR/EC:<br>
                                          &lt;guid&gt;(&lt;guid&gt;,
                                  &lt;guid&gt;)<br>
                                  <br>
                                          DHT:<br>
                                         
                                  &lt;guid&gt;(&lt;guid&gt;(&lt;guid&gt;,
                                  ...), &lt;guid&gt;(&lt;guid&gt;, ...))<br>
                                  <br>
                                          In this case, AFR and EC would
                                  return the same &lt;guid&gt; they<br>
                                          returned before the patch, but
                                  between &#39;(&#39; and &#39;)&#39; they put the<br>
                                          full list of guid&#39;s of all
                                  nodes. The first &lt;guid&gt; can be
                                  used<br>
                                          by geo-replication. The list
                                  after the first &lt;guid&gt; can be
                                  used<br>
                                          for rebalance.<br>
                                  <br>
                                          Not sure if there&#39;s any user
                                  of node-uuid above DHT.<br>
                                  <br>
                                          Xavi<br>
                                  <br>
                                  <br>
                                  <br>
                                  <br>
                                                  Xavi<br>
                                  <br>
                                  <br>
                                                      On Tue, Jun 20,
                                  2017 at 12:46 PM, Xavier Hernandez<br>
                                                      &lt;<a href="mailto:xhernandez@datalab.es" target="_blank">xhernandez@datalab.es</a><br>
                                              &lt;mailto:<a href="mailto:xhernandez@datalab.es" target="_blank">xhernandez@datalab.es</a>&gt;<br>
                                  &lt;mailto:<a href="mailto:xhernandez@datalab.es" target="_blank">xhernandez@datalab.es</a><br>
                                              &lt;mailto:<a href="mailto:xhernandez@datalab.es" target="_blank">xhernandez@datalab.es</a>&gt;<wbr>&gt;<br>
                                                      &lt;mailto:<a href="mailto:xhernandez@datalab.es" target="_blank">xhernandez@datalab.es</a><br>
                                              &lt;mailto:<a href="mailto:xhernandez@datalab.es" target="_blank">xhernandez@datalab.es</a>&gt;<br>
                                  &lt;mailto:<a href="mailto:xhernandez@datalab.es" target="_blank">xhernandez@datalab.es</a><br>
                                              &lt;mailto:<a href="mailto:xhernandez@datalab.es" target="_blank">xhernandez@datalab.es</a>&gt;<wbr>&gt;&gt;&gt;<br>
                                                      wrote:<br>
                                  <br>
                                                          Hi Pranith,<br>
                                  <br>
                                                          On 20/06/17
                                  07:53, Pranith Kumar Karampuri<br>
                                  wrote:<br>
                                  <br>
                                                              hi Xavi,<br>
                                                                     We
                                  all made the mistake of not<br>
                                              sending about changing<br>
                                                              behavior
                                  of<br>
                                                              node-uuid
                                  xattr so that rebalance can use<br>
                                              multiple nodes<br>
                                                      for doing<br>
                                                              rebalance.
                                  Because of this on geo-rep all<br>
                                              the workers<br>
                                                      are becoming<br>
                                                              active
                                  instead of one per EC/AFR subvolume.<br>
                                              So we are<br>
                                                             
                                  frantically trying<br>
                                                              to restore
                                  the functionality of node-uuid<br>
                                              and introduce<br>
                                                      a new<br>
                                                              xattr for<br>
                                                              the new
                                  behavior. Sunil will be sending out<br>
                                              a patch for<br>
                                                      this.<br>
                                  <br>
                                  <br>
                                                          Wouldn&#39;t it be
                                  better to change geo-rep<br>
                                  behavior<br>
                                              to use the<br>
                                                      new data<br>
                                                          ? I think it&#39;s
                                  better as it&#39;s now, since it<br>
                                              gives more<br>
                                                      information<br>
                                                          to upper
                                  layers so that they can take more<br>
                                              accurate decisions.<br>
                                  <br>
                                                          Xavi<br>
                                  <br>
                                  <br>
                                                              --<br>
                                                              Pranith<br>
                                  <br>
                                  <br>
                                  <br>
                                  <br>
                                  <br>
                                                      --<br>
                                                      Pranith<br>
                                  <br>
                                  <br>
                                  <br>
                                  <br>
                                  <br>
                                              --<br>
                                              Pranith<br>
                                  <br>
                                  <br>
                                  <br>
                                  <br>
                                  <br>
                                  <br>
                                  --<br>
                                  Pranith<br>
                                </blockquote>
                                <br>
                              </blockquote>
                              <br>
                            </blockquote>
                            <br>
                          </blockquote>
                          <br>
                        </div>
                      </div>
                    </blockquote>
                  </div>
                  <br>
                  <br clear="all">
                  <br>
                  -- <br>
                  <div class="m_-881474676573699517gmail_signature" data-smartmail="gmail_signature">
                    <div dir="ltr">Pranith<br>
                    </div>
                  </div>
                </div>
              </div>
            </div>
          </div>
        </div>
      </div>
    </blockquote>
    <br>
  </div></div></div>

</blockquote></div><br></div></div>