<div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">On Fri, Jul 7, 2017 at 1:13 PM, Xavier Hernandez <span dir="ltr">&lt;<a href="mailto:xhernandez@datalab.es" target="_blank">xhernandez@datalab.es</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Hi Pranith,<br>
<br>
On 05/07/17 12:28, Pranith Kumar Karampuri wrote:<br>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<br>
<br>
On Tue, Jul 4, 2017 at 2:26 PM, Xavier Hernandez &lt;<a href="mailto:xhernandez@datalab.es" target="_blank">xhernandez@datalab.es</a><br>
&lt;mailto:<a href="mailto:xhernandez@datalab.es" target="_blank">xhernandez@datalab.es</a>&gt;<wbr>&gt; wrote:<br>
<br>
    Hi Pranith,<br>
<br>
    On 03/07/17 08:33, Pranith Kumar Karampuri wrote:<br>
<br>
        Xavi,<br>
              Now that the change has been reverted, we can resume this<br>
        discussion and decide on the exact format that considers, tier, dht,<br>
        afr, ec. People working geo-rep/dht/afr/ec had an internal<br>
        discussion<br>
        and we all agreed that this proposal would be a good way forward. I<br>
        think once we agree on the format and decide on the initial<br>
        encoding/decoding functions of the xattr and this change is<br>
        merged, we<br>
        can send patches on afr/ec/dht and geo-rep to take it to closure.<br>
<br>
        Could you propose the new format you have in mind that considers<br>
        all of<br>
        the xlators?<br>
<br>
<br>
    My idea was to create a new xattr not bound to any particular<br>
    function but which could give enough information to be used in many<br>
    places.<br>
<br>
    Currently we have another attribute called glusterfs.pathinfo that<br>
    returns hierarchical information about the location of a file. Maybe<br>
    we can extend this to unify all these attributes into a single<br>
    feature that could be used for multiple purposes.<br>
<br>
    Since we have time to discuss it, I would like to design it with<br>
    more information than we already talked.<br>
<br>
    First of all, the amount of information that this attribute can<br>
    contain is quite big if we expect to have volumes with thousands of<br>
    bricks. Even in the most simple case of returning only an UUID, we<br>
    can easily go beyond the limit of 64KB.<br>
<br>
    Consider also, for example, what shard should return when pathinfo<br>
    is requested for a file. Probably it should return a list of shards,<br>
    each one with all its associated pathinfo. We are talking about big<br>
    amounts of data here.<br>
<br>
    I think this kind of information doesn&#39;t fit very well in an<br>
    extended attribute. Another think to consider is that most probably<br>
    the requester of the data only needs a fragment of it, so we are<br>
    generating big amounts of data only to be parsed and reduced later,<br>
    dismissing most of it.<br>
<br>
    What do you think about using a very special virtual file to manage<br>
    all this information ? it could be easily read using normal read<br>
    fops, so it could manage big amounts of data easily. Also, accessing<br>
    only to some parts of the file we could go directly where we want,<br>
    avoiding the read of all remaining data.<br>
<br>
    A very basic idea could be this:<br>
<br>
    Each xlator would have a reserved area of the file. We can reserve<br>
    up to 4GB per xlator (32 bits). The remaining 32 bits of the offset<br>
    would indicate the xlator we want to access.<br>
<br>
    At offset 0 we have generic information about the volume. One of the<br>
    the things that this information should include is a basic hierarchy<br>
    of the whole volume and the offset for each xlator.<br>
<br>
    After reading this, the user will seek to the desired offset and<br>
    read the information related to the xlator it is interested in.<br>
<br>
    All the information should be stored in a format easily extensible<br>
    that will be kept compatible even if new information is added in the<br>
    future (for example doing special mappings of the 32 bits offsets<br>
    reserved for the xlator).<br>
<br>
    For example we can reserve the first megabyte of the xlator area to<br>
    have a mapping of attributes with its respective offset.<br>
<br>
    I think that using a binary format would simplify all this a lot.<br>
<br>
    Do you think this is a way to explore or should I stop wasting time<br>
    here ?<br>
<br>
<br>
I think this just became a very big feature :-). Shall we just live with<br>
it the way it is now?<br>
</blockquote>
<br>
I supposed it...<br>
<br>
Only thing we need to check is if shard needs to handle this xattr. If so, what it should return ? only the UUID&#39;s corresponding to the first shard or the UUID&#39;s of all bricks containing at least one shard ? I guess that the first one is enough, but just to be sure...<br>
<br>
My proposal was to implement a new xattr, for example glusterfs.layout, that contains enough information to be usable in all current use cases.<br></blockquote><div><br></div><div>Actually pathinfo is supposed to give this information and it already has the following format: for a 5x2 distributed-replicate volume<br><br>root@dhcp35-190 - /mnt/v3 <br>13:38:12 :) ⚡ getfattr -n trusted.glusterfs.pathinfo d<br># file: d<br>trusted.glusterfs.pathinfo=&quot;((&lt;DISTRIBUTE:v3-dht&gt; (&lt;REPLICATE:v3-replicate-0&gt; &lt;POSIX(/home/gfs/v3_0):dhcp35-190.lab.eng.blr.redhat.com:/home/gfs/v3_0/d&gt; &lt;POSIX(/home/gfs/v3_1):dhcp35-190.lab.eng.blr.redhat.com:/home/gfs/v3_1/d&gt;) (&lt;REPLICATE:v3-replicate-2&gt; &lt;POSIX(/home/gfs/v3_5):dhcp35-190.lab.eng.blr.redhat.com:/home/gfs/v3_5/d&gt; &lt;POSIX(/home/gfs/v3_4):dhcp35-190.lab.eng.blr.redhat.com:/home/gfs/v3_4/d&gt;) (&lt;REPLICATE:v3-replicate-1&gt; &lt;POSIX(/home/gfs/v3_3):dhcp35-190.lab.eng.blr.redhat.com:/home/gfs/v3_3/d&gt; &lt;POSIX(/home/gfs/v3_2):dhcp35-190.lab.eng.blr.redhat.com:/home/gfs/v3_2/d&gt;) (&lt;REPLICATE:v3-replicate-4&gt; &lt;POSIX(/home/gfs/v3_8):dhcp35-190.lab.eng.blr.redhat.com:/home/gfs/v3_8/d&gt; &lt;POSIX(/home/gfs/v3_9):dhcp35-190.lab.eng.blr.redhat.com:/home/gfs/v3_9/d&gt;) (&lt;REPLICATE:v3-replicate-3&gt; &lt;POSIX(/home/gfs/v3_6):dhcp35-190.lab.eng.blr.redhat.com:/home/gfs/v3_6/d&gt; &lt;POSIX(/home/gfs/v3_7):dhcp35-190.lab.eng.blr.redhat.com:/home/gfs/v3_7/d&gt;)) (v3-dht-layout (v3-replicate-0 0 858993458) (v3-replicate-1 858993459 1717986917) (v3-replicate-2 1717986918 2576980376) (v3-replicate-3 2576980377 3435973835) (v3-replicate-4 3435973836 4294967295)))&quot;<br><br><br>root@dhcp35-190 - /mnt/v3 <br>13:38:26 :) ⚡ getfattr -n trusted.glusterfs.pathinfo d/a<br># file: d/a<br>trusted.glusterfs.pathinfo=&quot;(&lt;DISTRIBUTE:v3-dht&gt; (&lt;REPLICATE:v3-replicate-1&gt; &lt;POSIX(/home/gfs/v3_3):dhcp35-190.lab.eng.blr.redhat.com:/home/gfs/v3_3/d/a&gt; &lt;POSIX(/home/gfs/v3_2):dhcp35-190.lab.eng.blr.redhat.com:/home/gfs/v3_2/d/a&gt;))&quot;<br><br> <br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<br>
The idea would be that each xlator that makes a significant change in the way or the place where files are stored, should put information in this xattr. The information should include:<br>
<br>
* Type (basically AFR, EC, DHT, ...)<br>
* Basic configuration (replication and arbiter for AFR, data and redundancy for EC, # subvolumes for DHT, shard size for sharding, ...)<br>
* Quorum imposed by the xlator<br>
* UUID data comming from subvolumes (sorted by brick position)<br>
* It should be easily extensible in the future<br>
<br>
The last point is very important to avoid the issues we have seen now. We must be able to incorporate more information without breaking backward compatibility. To do so, we can add tags for each value.<br>
<br>
For example, a distribute 2, replica 2 volume with 1 arbiter should be represented by this string:<br>
<br>
   DHT[dist=2,quorum=1](<br>
      AFR[rep=2,arbiter=1,quorum=2](<br>
         NODE[quorum=2,uuid=&lt;UUID1&gt;](&lt;<wbr>path1&gt;),<br>
         NODE[quorum=2,uuid=&lt;UUID2&gt;](&lt;<wbr>path2&gt;),<br>
         NODE[quorum=2,uuid=&lt;UUID3&gt;](&lt;<wbr>path3&gt;)<br>
      ),<br>
      AFR[rep=2,arbiter=1,quorum=2](<br>
         NODE[quorum=2,uuid=&lt;UUID4&gt;](&lt;<wbr>path4&gt;),<br>
         NODE[quorum=2,uuid=&lt;UUID5&gt;](&lt;<wbr>path5&gt;),<br>
         NODE[quorum=2,uuid=&lt;UUID6&gt;](&lt;<wbr>path6&gt;)<br>
      )<br>
   )<br>
<br>
Some explanations:<br>
<br>
AFAIK DHT doesn&#39;t have quorum, so the default is &#39;1&#39;. We may decide to omit it when it&#39;s &#39;1&#39; for any xlator.<br>
<br>
Quorum in AFR represents client-side enforced quorum. Quorum in NODE represents the server-side enforced quorum.<br>
<br>
The &lt;path&gt; shown in each NODE represents the physical location of the file (similar to current glusterfs.pathinfo) because this xattr can be retrieved for a particular file using getxattr. This is nice, but we can remove it for now if it&#39;s difficult to implement.<br>
<br>
We can decide to have a verbose string or try to omit some fields when not strictly necessary. For example, if there are no arbiters, we can omit the &#39;arbiter&#39; tag instead of writing &#39;arbiter=0&#39;. We could also implicitly compute &#39;dist&#39; and &#39;rep&#39; from the number of elements contained between &#39;()&#39;.<br>
<br>
What do you think ?<br></blockquote><div><br></div><div>Quite a few people are already familiar with path-info. So I am of the opinion that we give this information for that xattr itself. This xattr hasn&#39;t changed after quorum/arbiter/shard came in, so may be they should?<br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<br>
Xavi<br>
<br>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<br>
<br>
<br>
    Xavi<br>
<br>
<br>
<br>
<br>
        On Wed, Jun 21, 2017 at 2:08 PM, Karthik Subrahmanya<br>
        &lt;<a href="mailto:ksubrahm@redhat.com" target="_blank">ksubrahm@redhat.com</a> &lt;mailto:<a href="mailto:ksubrahm@redhat.com" target="_blank">ksubrahm@redhat.com</a>&gt;<br>
        &lt;mailto:<a href="mailto:ksubrahm@redhat.com" target="_blank">ksubrahm@redhat.com</a> &lt;mailto:<a href="mailto:ksubrahm@redhat.com" target="_blank">ksubrahm@redhat.com</a>&gt;&gt;&gt; wrote:<br>
<br>
<br>
<br>
            On Wed, Jun 21, 2017 at 1:56 PM, Xavier Hernandez<br>
            &lt;<a href="mailto:xhernandez@datalab.es" target="_blank">xhernandez@datalab.es</a> &lt;mailto:<a href="mailto:xhernandez@datalab.es" target="_blank">xhernandez@datalab.es</a>&gt;<br>
        &lt;mailto:<a href="mailto:xhernandez@datalab.es" target="_blank">xhernandez@datalab.es</a> &lt;mailto:<a href="mailto:xhernandez@datalab.es" target="_blank">xhernandez@datalab.es</a>&gt;<wbr>&gt;&gt;<br>
        wrote:<br>
<br>
                That&#39;s ok. I&#39;m currently unable to write a patch for<br>
        this on ec.<br>
<br>
            Sunil is working on this patch.<br>
<br>
            ~Karthik<br>
<br>
                If no one can do it, I can try to do it in 6 - 7 hours...<br>
<br>
                Xavi<br>
<br>
<br>
                On Wednesday, June 21, 2017 09:48 CEST, Pranith Kumar<br>
        Karampuri<br>
                &lt;<a href="mailto:pkarampu@redhat.com" target="_blank">pkarampu@redhat.com</a> &lt;mailto:<a href="mailto:pkarampu@redhat.com" target="_blank">pkarampu@redhat.com</a>&gt;<br>
        &lt;mailto:<a href="mailto:pkarampu@redhat.com" target="_blank">pkarampu@redhat.com</a> &lt;mailto:<a href="mailto:pkarampu@redhat.com" target="_blank">pkarampu@redhat.com</a>&gt;&gt;&gt; wrote:<br>
<br>
<br>
<br>
                    On Wed, Jun 21, 2017 at 1:00 PM, Xavier Hernandez<br>
                    &lt;<a href="mailto:xhernandez@datalab.es" target="_blank">xhernandez@datalab.es</a><br>
            &lt;mailto:<a href="mailto:xhernandez@datalab.es" target="_blank">xhernandez@datalab.es</a>&gt; &lt;mailto:<a href="mailto:xhernandez@datalab.es" target="_blank">xhernandez@datalab.es</a><br>
            &lt;mailto:<a href="mailto:xhernandez@datalab.es" target="_blank">xhernandez@datalab.es</a>&gt;<wbr>&gt;&gt; wrote:<br>
<br>
                        I&#39;m ok with reverting node-uuid content to the<br>
            previous<br>
                        format and create a new xattr for the new format.<br>
                        Currently, only rebalance will use it.<br>
<br>
                        Only thing to consider is what can happen if we<br>
            have a<br>
                        half upgraded cluster where some clients have<br>
            this change<br>
                        and some not. Can rebalance work in this<br>
            situation ? if<br>
                        so, could there be any issue ?<br>
<br>
<br>
                    I think there shouldn&#39;t be any problem, because this is<br>
                    in-memory xattr so layers below afr/ec will only see<br>
            node-uuid<br>
                    xattr.<br>
                    This also gives us a chance to do whatever we want<br>
            to do in<br>
                    future with this xattr without any problems about<br>
            backward<br>
                    compatibility.<br>
<br>
                    You can check<br>
<br>
            <a href="https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/src/afr-inode-read.c@1507" rel="noreferrer" target="_blank">https://review.gluster.org/#/c<wbr>/17576/3/xlators/cluster/afr/s<wbr>rc/afr-inode-read.c@1507</a><br>
            &lt;<a href="https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/src/afr-inode-read.c@1507" rel="noreferrer" target="_blank">https://review.gluster.org/#/<wbr>c/17576/3/xlators/cluster/afr/<wbr>src/afr-inode-read.c@1507</a>&gt;<br>
<br>
            &lt;<a href="https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/src/afr-inode-read.c@1507" rel="noreferrer" target="_blank">https://review.gluster.org/#/<wbr>c/17576/3/xlators/cluster/afr/<wbr>src/afr-inode-read.c@1507</a><br>
            &lt;<a href="https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/src/afr-inode-read.c@1507" rel="noreferrer" target="_blank">https://review.gluster.org/#/<wbr>c/17576/3/xlators/cluster/afr/<wbr>src/afr-inode-read.c@1507</a>&gt;&gt;<br>
                    for how karthik implemented this in AFR (this got merged<br>
                    accidentally yesterday, but looks like this is what<br>
            we are<br>
                    settling on)<br>
<br>
<br>
<br>
                        Xavi<br>
<br>
<br>
                        On Wednesday, June 21, 2017 06:56 CEST, Pranith<br>
            Kumar<br>
                        Karampuri &lt;<a href="mailto:pkarampu@redhat.com" target="_blank">pkarampu@redhat.com</a><br>
            &lt;mailto:<a href="mailto:pkarampu@redhat.com" target="_blank">pkarampu@redhat.com</a>&gt;<br>
                        &lt;mailto:<a href="mailto:pkarampu@redhat.com" target="_blank">pkarampu@redhat.com</a><br>
            &lt;mailto:<a href="mailto:pkarampu@redhat.com" target="_blank">pkarampu@redhat.com</a>&gt;&gt;&gt; wrote:<br>
<br>
<br>
<br>
                            On Wed, Jun 21, 2017 at 10:07 AM, Nithya<br>
                Balachandran<br>
                            &lt;<a href="mailto:nbalacha@redhat.com" target="_blank">nbalacha@redhat.com</a><br>
                &lt;mailto:<a href="mailto:nbalacha@redhat.com" target="_blank">nbalacha@redhat.com</a>&gt; &lt;mailto:<a href="mailto:nbalacha@redhat.com" target="_blank">nbalacha@redhat.com</a><br>
                &lt;mailto:<a href="mailto:nbalacha@redhat.com" target="_blank">nbalacha@redhat.com</a>&gt;&gt;&gt; wrote:<br>
<br>
<br>
                                On 20 June 2017 at 20:38, Aravinda<br>
                                &lt;<a href="mailto:avishwan@redhat.com" target="_blank">avishwan@redhat.com</a><br>
                &lt;mailto:<a href="mailto:avishwan@redhat.com" target="_blank">avishwan@redhat.com</a>&gt; &lt;mailto:<a href="mailto:avishwan@redhat.com" target="_blank">avishwan@redhat.com</a><br>
                &lt;mailto:<a href="mailto:avishwan@redhat.com" target="_blank">avishwan@redhat.com</a>&gt;&gt;&gt; wrote:<br>
<br>
                                    On 06/20/2017 06:02 PM, Pranith<br>
                Kumar Karampuri<br>
                                    wrote:<br>
<br>
                                        Xavi, Aravinda and I had a<br>
                    discussion on<br>
                                        #gluster-dev and we agreed to go<br>
                    with the format<br>
                                        Aravinda suggested for now and<br>
                    in future we<br>
                                        wanted some more changes for dht<br>
                    to detect which<br>
                                        subvolume went down came back<br>
                    up, at that time<br>
                                        we will revisit the solution<br>
                    suggested by Xavi.<br>
<br>
                                        Susanth is doing the dht changes<br>
                                        Aravinda is doing geo-rep changes<br>
<br>
                                    Done. Geo-rep patch sent for review<br>
                                    <a href="https://review.gluster.org/17582" rel="noreferrer" target="_blank">https://review.gluster.org/175<wbr>82</a><br>
                &lt;<a href="https://review.gluster.org/17582" rel="noreferrer" target="_blank">https://review.gluster.org/17<wbr>582</a>&gt;<br>
                                    &lt;<a href="https://review.gluster.org/17582" rel="noreferrer" target="_blank">https://review.gluster.org/17<wbr>582</a><br>
                &lt;<a href="https://review.gluster.org/17582" rel="noreferrer" target="_blank">https://review.gluster.org/17<wbr>582</a>&gt;&gt;<br>
<br>
<br>
<br>
                                The proposed changes to the node-uuid<br>
                behaviour<br>
                                (while good) are going to break tiering<br>
                . Tiering<br>
                                changes will take a little more time to<br>
                be coded and<br>
                                tested.<br>
<br>
                                As this is a regression for 3.11 and a<br>
                blocker for<br>
                                3.11.1, I suggest we go back to the original<br>
                                node-uuid behaviour for now so as to<br>
                unblock the<br>
                                release and target the proposed changes<br>
                for the next<br>
                                3.11 releases.<br>
<br>
<br>
                            Let me see if I understand the changes<br>
                correctly. We are<br>
                            restoring the behavior of node-uuid xattr<br>
                and adding a<br>
                            new xattr for parallel rebalance for both<br>
                afr and ec,<br>
                            correct? Otherwise that is one more<br>
                regression. If yes,<br>
                            we will also wait for Xavi&#39;s inputs. Jeff<br>
                accidentally<br>
                            merged the afr patch yesterday which does<br>
                these changes.<br>
                            If everyone is in agreement, we will leave<br>
                it as is and<br>
                            add similar changes in ec as well. If we are<br>
                not in<br>
                            agreement, then we will let the discussion<br>
                progress :-)<br>
<br>
<br>
<br>
<br>
                                Regards,<br>
                                Nithya<br>
<br>
                                    --<br>
                                    Aravinda<br>
<br>
<br>
                                        Thanks to all of you guys for<br>
                    the discussions!<br>
<br>
                                        On Tue, Jun 20, 2017 at 5:05 PM,<br>
                    Xavier<br>
                                        Hernandez &lt;<a href="mailto:xhernandez@datalab.es" target="_blank">xhernandez@datalab.es</a><br>
                    &lt;mailto:<a href="mailto:xhernandez@datalab.es" target="_blank">xhernandez@datalab.es</a>&gt;<br>
                                        &lt;mailto:<a href="mailto:xhernandez@datalab.es" target="_blank">xhernandez@datalab.es</a><br>
                    &lt;mailto:<a href="mailto:xhernandez@datalab.es" target="_blank">xhernandez@datalab.es</a>&gt;<wbr>&gt;&gt; wrote:<br>
<br>
                                            Hi Aravinda,<br>
<br>
                                            On 20/06/17 12:42, Aravinda<br>
                    wrote:<br>
<br>
                                                I think following format<br>
                    can be easily<br>
                                                adopted by all components<br>
<br>
                                                UUIDs of a subvolume are<br>
                    seperated by<br>
                                                space and subvolumes are<br>
                    separated<br>
                                                by comma<br>
<br>
                                                For example, node1 and<br>
                    node2 are replica<br>
                                                with U1 and U2 UUIDs<br>
                                                respectively and<br>
                                                node3 and node4 are<br>
                    replica with U3 and<br>
                                                U4 UUIDs respectively<br>
<br>
                                                node-uuid can return &quot;U1<br>
                    U2,U3 U4&quot;<br>
<br>
<br>
                                            While this is ok for current<br>
                    implementation,<br>
                                            I think this can be<br>
                    insufficient if there<br>
                                            are more layers of xlators<br>
                    that require to<br>
                                            indicate some sort of<br>
                    grouping. Some<br>
                                            representation that can<br>
                    represent hierarchy<br>
                                            would be better. For<br>
                    example: &quot;(U1 U2) (U3<br>
                                            U4)&quot; (we can use spaces or<br>
                    comma as a<br>
                                            separator).<br>
<br>
<br>
<br>
                                                Geo-rep can split by &quot;,&quot;<br>
                    and then split<br>
                                                by space and take first UUID<br>
                                                DHT can split the value<br>
                    by space or<br>
                                                comma and get unique<br>
                    UUIDs list<br>
<br>
<br>
                                            This doesn&#39;t solve the<br>
                    problem I described<br>
                                            in the previous email. Some<br>
                    more logic will<br>
                                            need to be added to avoid<br>
                    more than one node<br>
                                            from each replica-set to be<br>
                    active. If we<br>
                                            have some explicit hierarchy<br>
                    information in<br>
                                            the node-uuid value, more<br>
                    decisions can be<br>
                                            taken.<br>
<br>
                                            An initial proposal I made<br>
                    was this:<br>
<br>
                                            DHT[2](AFR[2,0](NODE(U1),<br>
                    NODE(U2)),<br>
                                            AFR[2,0](NODE(U1), NODE(U2)))<br>
<br>
                                            This is harder to parse, but<br>
                    gives a lot of<br>
                                            information: DHT with 2<br>
                    subvolumes, each<br>
                                            subvolume is an AFR with<br>
                    replica 2 and no<br>
                                            arbiters. It&#39;s also easily<br>
                    extensible with<br>
                                            any new xlator that changes<br>
                    the layout.<br>
<br>
                                            However maybe this is not<br>
                    the moment to do<br>
                                            this, and probably we could<br>
                    implement this<br>
                                            in a new xattr with a better<br>
                    name.<br>
<br>
                                            Xavi<br>
<br>
<br>
<br>
                                                Another question is<br>
                    about the behavior<br>
                                                when a node is down,<br>
                    existing<br>
                                                node-uuid xattr will not<br>
                    return that<br>
                                                UUID if a node is down.<br>
                    What is the<br>
                                                behavior with the<br>
                    proposed xattr?<br>
<br>
                                                Let me know your thoughts.<br>
<br>
                                                regards<br>
                                                Aravinda VK<br>
<br>
                                                On 06/20/2017 03:06 PM,<br>
                    Aravinda wrote:<br>
<br>
                                                    Hi Xavi,<br>
<br>
                                                    On 06/20/2017 02:51<br>
                    PM, Xavier<br>
                                                    Hernandez wrote:<br>
<br>
                                                        Hi Aravinda,<br>
<br>
                                                        On 20/06/17<br>
                    11:05, Pranith Kumar<br>
                                                        Karampuri wrote:<br>
<br>
                                                            Adding more<br>
                    people to get a<br>
                                                            consensus<br>
                    about this.<br>
<br>
                                                            On Tue, Jun<br>
                    20, 2017 at 1:49<br>
                                                            PM, Aravinda<br>
<br>
                    &lt;<a href="mailto:avishwan@redhat.com" target="_blank">avishwan@redhat.com</a> &lt;mailto:<a href="mailto:avishwan@redhat.com" target="_blank">avishwan@redhat.com</a>&gt;<br>
<br>
                    &lt;mailto:<a href="mailto:avishwan@redhat.com" target="_blank">avishwan@redhat.com</a><br>
                    &lt;mailto:<a href="mailto:avishwan@redhat.com" target="_blank">avishwan@redhat.com</a>&gt;&gt;<br>
<br>
                    &lt;mailto:<a href="mailto:avishwan@redhat.com" target="_blank">avishwan@redhat.com</a> &lt;mailto:<a href="mailto:avishwan@redhat.com" target="_blank">avishwan@redhat.com</a>&gt;<br>
<br>
                    &lt;mailto:<a href="mailto:avishwan@redhat.com" target="_blank">avishwan@redhat.com</a><br>
                    &lt;mailto:<a href="mailto:avishwan@redhat.com" target="_blank">avishwan@redhat.com</a>&gt;&gt;&gt;<wbr>&gt;<br>
                                                            wrote:<br>
<br>
<br>
                                                                regards<br>
                                                                Aravinda VK<br>
<br>
<br>
                                                                On<br>
                    06/20/2017 01:26 PM,<br>
                                                            Xavier<br>
                    Hernandez wrote:<br>
<br>
                                                                    Hi<br>
                    Pranith,<br>
<br>
                                                                    adding<br>
<br>
                    gluster-devel, Kotresh and<br>
                                                            Aravinda,<br>
<br>
                                                                    On<br>
                    20/06/17 09:45,<br>
                                                            Pranith<br>
                    Kumar Karampuri wrote:<br>
<br>
<br>
<br>
<br>
                    On Tue, Jun 20,<br>
                                                            2017 at 1:12<br>
                    PM, Xavier<br>
                                                            Hernandez<br>
<br>
<br>
                    &lt;<a href="mailto:xhernandez@datalab.es" target="_blank">xhernandez@datalab.es</a> &lt;mailto:<a href="mailto:xhernandez@datalab.es" target="_blank">xhernandez@datalab.es</a>&gt;<br>
<br>
                    &lt;mailto:<a href="mailto:xhernandez@datalab.es" target="_blank">xhernandez@datalab.es</a><br>
                    &lt;mailto:<a href="mailto:xhernandez@datalab.es" target="_blank">xhernandez@datalab.es</a>&gt;<wbr>&gt;<br>
<br>
                    &lt;mailto:<a href="mailto:xhernandez@datalab.es" target="_blank">xhernandez@datalab.es</a><br>
                    &lt;mailto:<a href="mailto:xhernandez@datalab.es" target="_blank">xhernandez@datalab.es</a>&gt;<br>
                    &lt;mailto:<a href="mailto:xhernandez@datalab.es" target="_blank">xhernandez@datalab.es</a><br>
                    &lt;mailto:<a href="mailto:xhernandez@datalab.es" target="_blank">xhernandez@datalab.es</a>&gt;<wbr>&gt;&gt;<br>
<br>
<br>
                    &lt;mailto:<a href="mailto:xhernandez@datalab.es" target="_blank">xhernandez@datalab.es</a><br>
                    &lt;mailto:<a href="mailto:xhernandez@datalab.es" target="_blank">xhernandez@datalab.es</a>&gt;<br>
                    &lt;mailto:<a href="mailto:xhernandez@datalab.es" target="_blank">xhernandez@datalab.es</a><br>
                    &lt;mailto:<a href="mailto:xhernandez@datalab.es" target="_blank">xhernandez@datalab.es</a>&gt;<wbr>&gt;<br>
<br>
<br>
                    &lt;mailto:<a href="mailto:xhernandez@datalab.es" target="_blank">xhernandez@datalab.es</a><br>
                    &lt;mailto:<a href="mailto:xhernandez@datalab.es" target="_blank">xhernandez@datalab.es</a>&gt;<br>
                    &lt;mailto:<a href="mailto:xhernandez@datalab.es" target="_blank">xhernandez@datalab.es</a><br>
                    &lt;mailto:<a href="mailto:xhernandez@datalab.es" target="_blank">xhernandez@datalab.es</a>&gt;<wbr>&gt;&gt;&gt;&gt;<br>
                                                            wrote:<br>
<br>
<br>
                        On 20/06/17<br>
                                                            09:31,<br>
                    Pranith Kumar<br>
                                                            Karampuri wrote:<br>
<br>
<br>
                            The way<br>
<br>
                    geo-replication works is:<br>
<br>
                            On each<br>
                                                            machine, it<br>
                    does getxattr of<br>
                                                            node-uuid and<br>
<br>
                    check if its<br>
<br>
                            own uuid<br>
<br>
                            is<br>
                                                            present in<br>
                    the list. If it<br>
                                                            is present<br>
                    then it<br>
<br>
                    will consider<br>
<br>
                            it active<br>
<br>
                                                            otherwise it<br>
                    will be<br>
                                                            considered<br>
                    passive. With this<br>
<br>
                    change we are<br>
<br>
                            giving<br>
<br>
                            all<br>
                                                            uuids<br>
                    instead of first-up<br>
                                                            subvolume.<br>
                    So all<br>
<br>
                    machines think<br>
<br>
                            they are<br>
<br>
                            ACTIVE<br>
                                                            which is bad<br>
                    apparently. So<br>
                                                            that is the<br>
<br>
                    reason. Even I<br>
<br>
                            felt bad<br>
<br>
                            that we<br>
                                                            are doing<br>
                    this change.<br>
<br>
<br>
<br>
                        And what<br>
                                                            about<br>
                    changing the content<br>
                                                            of node-uuid to<br>
<br>
                    include some<br>
<br>
                        sort of<br>
                                                            hierarchy ?<br>
<br>
<br>
                        for example:<br>
<br>
<br>
                        a single brick:<br>
<br>
<br>
                        NODE(&lt;guid&gt;)<br>
<br>
<br>
                        AFR/EC:<br>
<br>
<br>
<br>
                    AFR[2](NODE(&lt;guid&gt;),<br>
                                                            NODE(&lt;guid&gt;))<br>
<br>
<br>
                    EC[3,1](NODE(&lt;guid&gt;),<br>
<br>
                    NODE(&lt;guid&gt;), NODE(&lt;guid&gt;))<br>
<br>
<br>
                        DHT:<br>
<br>
<br>
<br>
                    DHT[2](AFR[2](NODE(&lt;guid&gt;),<br>
                                                            NODE(&lt;guid&gt;)),<br>
<br>
                    AFR[2](NODE(&lt;guid&gt;),<br>
<br>
                        NODE(&lt;guid&gt;)))<br>
<br>
<br>
                        This gives a<br>
                                                            lot of<br>
                    information that can<br>
                                                            be used to<br>
                                                            take the<br>
<br>
                        appropriate<br>
                                                            decisions.<br>
<br>
<br>
<br>
                    I guess that is<br>
                                                            not backward<br>
                    compatible.<br>
                                                            Shall I CC<br>
<br>
                    gluster-devel and<br>
<br>
                    Kotresh/Aravinda?<br>
<br>
<br>
                                                                    Is<br>
                    the change we did<br>
                                                            backward<br>
                    compatible ? if we<br>
                                                            only require<br>
                                                                    the<br>
                    first field to<br>
                                                            be a GUID to<br>
                    support<br>
                                                            backward<br>
                    compatibility,<br>
                                                                    we<br>
                    can use something<br>
                                                            like this:<br>
<br>
                                                                No. But<br>
                    the necessary<br>
                                                            change can<br>
                    be made to<br>
                                                            Geo-rep code<br>
                    as well if<br>
                                                                format<br>
                    is changed, Since<br>
                                                            all these<br>
                    are built/shipped<br>
                                                            together.<br>
<br>
                                                                Geo-rep<br>
                    uses node-id as<br>
                                                            follows,<br>
<br>
                                                                list =<br>
                    listxattr(node-uuid)<br>
<br>
                    active_node_uuids =<br>
<br>
                    list.split(SPACE)<br>
<br>
                    active_node_flag = True<br>
                                                            if<br>
                    self.node_id exists in<br>
<br>
                    active_node_uuids<br>
                                                                else False<br>
<br>
<br>
                                                        How was this<br>
                    case solved ?<br>
<br>
                                                        suppose we have<br>
                    three servers<br>
                                                        and 2 bricks in<br>
                    each server. A<br>
                                                        replicated<br>
                    volume is created<br>
                                                        using the<br>
                    following command:<br>
<br>
                                                        gluster volume<br>
                    create test<br>
                                                        replica 2<br>
                    server1:/brick1<br>
                                                        server2:/brick1<br>
                                                        server2:/brick2<br>
                    server3:/brick1<br>
                                                        server3:/brick1<br>
                    server1:/brick2<br>
<br>
                                                        In this case we<br>
                    have three<br>
                                                        replica-sets:<br>
<br>
                                                        *<br>
                    server1:/brick1 server2:/brick1<br>
                                                        *<br>
                    server2:/brick2 server3:/brick1<br>
                                                        *<br>
                    server3:/brick2 server2:/brick2<br>
<br>
                                                        Old AFR<br>
                    implementation for<br>
                                                        node-uuid always<br>
                    returned the<br>
                                                        uuid of the<br>
                                                        node of the<br>
                    first brick, so in<br>
                                                        this case we<br>
                    will get the uuid<br>
                                                        of the<br>
                                                        three nodes<br>
                    because all of them<br>
                                                        are the first<br>
                    brick of a<br>
                                                        replica-set.<br>
<br>
                                                        Does this mean<br>
                    that with this<br>
                                                        configuration<br>
                    all nodes are<br>
                                                        active ? Is<br>
                                                        this a problem ?<br>
                    Is there any<br>
                                                        other check to<br>
                    avoid this<br>
                                                        situation if<br>
                                                        it&#39;s not good ?<br>
<br>
                                                    Yes all Geo-rep<br>
                    workers will become<br>
                                                    Active and<br>
                    participate in syncing.<br>
                                                    Since changelogs<br>
                    will have the same<br>
                                                    information in<br>
                    replica bricks this<br>
                                                    will lead to<br>
                    duplicate syncing and<br>
                                                    consuming network<br>
                    bandwidth.<br>
<br>
                                                    Node-uuid based<br>
                    Active worker is the<br>
                                                    default<br>
                    configuration in Geo-rep<br>
                                                    till now, Geo-rep<br>
                    also has Meta<br>
                                                    Volume based<br>
                    syncronization for Active<br>
                                                    worker using lock<br>
                    files.(Can be<br>
                                                    opted using Geo-rep<br>
                    configuration,<br>
                                                    with this config<br>
                    node-uuid will not<br>
                                                    be used)<br>
<br>
                                                    Kotresh proposed a<br>
                    solution to<br>
                                                    configure which<br>
                    worker to become<br>
                                                    Active. This will<br>
                    give more control<br>
                                                    to Admin to choose<br>
                    Active workers,<br>
                                                    This will become default<br>
                                                    configuration from 3.12<br>
<br>
                    <a href="https://github.com/gluster/glusterfs/issues/244" rel="noreferrer" target="_blank">https://github.com/gluster/glu<wbr>sterfs/issues/244</a><br>
                    &lt;<a href="https://github.com/gluster/glusterfs/issues/244" rel="noreferrer" target="_blank">https://github.com/gluster/gl<wbr>usterfs/issues/244</a>&gt;<br>
<br>
                    &lt;<a href="https://github.com/gluster/glusterfs/issues/244" rel="noreferrer" target="_blank">https://github.com/gluster/gl<wbr>usterfs/issues/244</a><br>
                    &lt;<a href="https://github.com/gluster/glusterfs/issues/244" rel="noreferrer" target="_blank">https://github.com/gluster/gl<wbr>usterfs/issues/244</a>&gt;&gt;<br>
<br>
                                                    --<br>
                                                    Aravinda<br>
<br>
<br>
<br>
                                                        Xavi<br>
<br>
<br>
<br>
<br>
<br>
                                                                    Bricks:<br>
<br>
                                                                    &lt;guid&gt;<br>
<br>
                                                                    AFR/EC:<br>
<br>
                    &lt;guid&gt;(&lt;guid&gt;, &lt;guid&gt;)<br>
<br>
                                                                    DHT:<br>
<br>
<br>
                    &lt;guid&gt;(&lt;guid&gt;(&lt;guid&gt;, ...),<br>
<br>
                    &lt;guid&gt;(&lt;guid&gt;, ...))<br>
<br>
                                                                    In<br>
                    this case, AFR<br>
                                                            and EC would<br>
                    return the same<br>
                                                            &lt;guid&gt; they<br>
<br>
                    returned before the<br>
                                                            patch, but<br>
                    between &#39;(&#39; and<br>
                                                            &#39;)&#39; they put the<br>
                                                                    full<br>
                    list of guid&#39;s<br>
                                                            of all<br>
                    nodes. The first<br>
                                                            &lt;guid&gt; can<br>
                    be used<br>
                                                                    by<br>
                    geo-replication.<br>
                                                            The list<br>
                    after the first<br>
                                                            &lt;guid&gt; can<br>
                    be used<br>
                                                                    for<br>
                    rebalance.<br>
<br>
                                                                    Not<br>
                    sure if there&#39;s<br>
                                                            any user of<br>
                    node-uuid above DHT.<br>
<br>
                                                                    Xavi<br>
<br>
<br>
<br>
<br>
<br>
                        Xavi<br>
<br>
<br>
<br>
                            On Tue,<br>
                                                            Jun 20, 2017<br>
                    at 12:46 PM,<br>
                                                            Xavier Hernandez<br>
<br>
<br>
                    &lt;<a href="mailto:xhernandez@datalab.es" target="_blank">xhernandez@datalab.es</a> &lt;mailto:<a href="mailto:xhernandez@datalab.es" target="_blank">xhernandez@datalab.es</a>&gt;<br>
<br>
                    &lt;mailto:<a href="mailto:xhernandez@datalab.es" target="_blank">xhernandez@datalab.es</a><br>
                    &lt;mailto:<a href="mailto:xhernandez@datalab.es" target="_blank">xhernandez@datalab.es</a>&gt;<wbr>&gt;<br>
<br>
<br>
                    &lt;mailto:<a href="mailto:xhernandez@datalab.es" target="_blank">xhernandez@datalab.es</a><br>
                    &lt;mailto:<a href="mailto:xhernandez@datalab.es" target="_blank">xhernandez@datalab.es</a>&gt;<br>
                    &lt;mailto:<a href="mailto:xhernandez@datalab.es" target="_blank">xhernandez@datalab.es</a><br>
                    &lt;mailto:<a href="mailto:xhernandez@datalab.es" target="_blank">xhernandez@datalab.es</a>&gt;<wbr>&gt;&gt;<br>
<br>
                    &lt;mailto:<a href="mailto:xhernandez@datalab.es" target="_blank">xhernandez@datalab.es</a><br>
                    &lt;mailto:<a href="mailto:xhernandez@datalab.es" target="_blank">xhernandez@datalab.es</a>&gt;<br>
                    &lt;mailto:<a href="mailto:xhernandez@datalab.es" target="_blank">xhernandez@datalab.es</a><br>
                    &lt;mailto:<a href="mailto:xhernandez@datalab.es" target="_blank">xhernandez@datalab.es</a>&gt;<wbr>&gt;<br>
<br>
<br>
                    &lt;mailto:<a href="mailto:xhernandez@datalab.es" target="_blank">xhernandez@datalab.es</a><br>
                    &lt;mailto:<a href="mailto:xhernandez@datalab.es" target="_blank">xhernandez@datalab.es</a>&gt;<br>
                    &lt;mailto:<a href="mailto:xhernandez@datalab.es" target="_blank">xhernandez@datalab.es</a><br>
                    &lt;mailto:<a href="mailto:xhernandez@datalab.es" target="_blank">xhernandez@datalab.es</a>&gt;<wbr>&gt;&gt;&gt;<br>
<br>
<br>
                    &lt;mailto:<a href="mailto:xhernandez@datalab.es" target="_blank">xhernandez@datalab.es</a><br>
                    &lt;mailto:<a href="mailto:xhernandez@datalab.es" target="_blank">xhernandez@datalab.es</a>&gt;<br>
                    &lt;mailto:<a href="mailto:xhernandez@datalab.es" target="_blank">xhernandez@datalab.es</a><br>
                    &lt;mailto:<a href="mailto:xhernandez@datalab.es" target="_blank">xhernandez@datalab.es</a>&gt;<wbr>&gt;<br>
<br>
<br>
                    &lt;mailto:<a href="mailto:xhernandez@datalab.es" target="_blank">xhernandez@datalab.es</a><br>
                    &lt;mailto:<a href="mailto:xhernandez@datalab.es" target="_blank">xhernandez@datalab.es</a>&gt;<br>
                    &lt;mailto:<a href="mailto:xhernandez@datalab.es" target="_blank">xhernandez@datalab.es</a><br>
                    &lt;mailto:<a href="mailto:xhernandez@datalab.es" target="_blank">xhernandez@datalab.es</a>&gt;<wbr>&gt;&gt;<br>
<br>
                    &lt;mailto:<a href="mailto:xhernandez@datalab.es" target="_blank">xhernandez@datalab.es</a><br>
                    &lt;mailto:<a href="mailto:xhernandez@datalab.es" target="_blank">xhernandez@datalab.es</a>&gt;<br>
                    &lt;mailto:<a href="mailto:xhernandez@datalab.es" target="_blank">xhernandez@datalab.es</a><br>
                    &lt;mailto:<a href="mailto:xhernandez@datalab.es" target="_blank">xhernandez@datalab.es</a>&gt;<wbr>&gt;<br>
<br>
<br>
                    &lt;mailto:<a href="mailto:xhernandez@datalab.es" target="_blank">xhernandez@datalab.es</a><br>
                    &lt;mailto:<a href="mailto:xhernandez@datalab.es" target="_blank">xhernandez@datalab.es</a>&gt;<br>
                    &lt;mailto:<a href="mailto:xhernandez@datalab.es" target="_blank">xhernandez@datalab.es</a><br>
                    &lt;mailto:<a href="mailto:xhernandez@datalab.es" target="_blank">xhernandez@datalab.es</a>&gt;<wbr>&gt;&gt;&gt;&gt;&gt;<br>
<br>
<br>
                            wrote:<br>
<br>
<br>
                                Hi<br>
                                                            Pranith,<br>
<br>
<br>
                                On<br>
                                                            20/06/17<br>
                    07:53, Pranith<br>
                                                            Kumar Karampuri<br>
                                                            wrote:<br>
<br>
<br>
                                                            hi Xavi,<br>
<br>
                                                                   We<br>
                    all made the<br>
                                                            mistake of not<br>
<br>
                    sending about<br>
                                                            changing<br>
<br>
                                                            behavior of<br>
<br>
                                                            node-uuid<br>
                    xattr so that<br>
                                                            rebalance<br>
                    can use<br>
<br>
                    multiple nodes<br>
<br>
                            for doing<br>
<br>
                                                            rebalance.<br>
                    Because of this<br>
                                                            on geo-rep all<br>
<br>
                    the workers<br>
<br>
                            are becoming<br>
<br>
                                                            active<br>
                    instead of one per<br>
                                                            EC/AFR<br>
                    subvolume.<br>
<br>
                    So we are<br>
<br>
                                                            frantically<br>
                    trying<br>
<br>
                                                            to restore<br>
                    the functionality<br>
                                                            of node-uuid<br>
<br>
                    and introduce<br>
<br>
                            a new<br>
<br>
                                                            xattr for</blockquote>
</blockquote></div><br><br clear="all"><br>-- <br><div class="gmail_signature"><div dir="ltr">Pranith<br></div></div>
</div></div>