[Gluster-devel] geo-rep regression because of node-uuid change

Xavier Hernandez xhernandez at datalab.es
Fri Jul 7 09:16:37 UTC 2017


On 07/07/17 10:12, Pranith Kumar Karampuri wrote:
>
>
> On Fri, Jul 7, 2017 at 1:13 PM, Xavier Hernandez <xhernandez at datalab.es
> <mailto:xhernandez at datalab.es>> wrote:
>
>     Hi Pranith,
>
>     On 05/07/17 12:28, Pranith Kumar Karampuri wrote:
>
>
>
>         On Tue, Jul 4, 2017 at 2:26 PM, Xavier Hernandez
>         <xhernandez at datalab.es <mailto:xhernandez at datalab.es>
>         <mailto:xhernandez at datalab.es <mailto:xhernandez at datalab.es>>>
>         wrote:
>
>             Hi Pranith,
>
>             On 03/07/17 08:33, Pranith Kumar Karampuri wrote:
>
>                 Xavi,
>                       Now that the change has been reverted, we can
>         resume this
>                 discussion and decide on the exact format that
>         considers, tier, dht,
>                 afr, ec. People working geo-rep/dht/afr/ec had an internal
>                 discussion
>                 and we all agreed that this proposal would be a good way
>         forward. I
>                 think once we agree on the format and decide on the initial
>                 encoding/decoding functions of the xattr and this change is
>                 merged, we
>                 can send patches on afr/ec/dht and geo-rep to take it to
>         closure.
>
>                 Could you propose the new format you have in mind that
>         considers
>                 all of
>                 the xlators?
>
>
>             My idea was to create a new xattr not bound to any particular
>             function but which could give enough information to be used
>         in many
>             places.
>
>             Currently we have another attribute called
>         glusterfs.pathinfo that
>             returns hierarchical information about the location of a
>         file. Maybe
>             we can extend this to unify all these attributes into a single
>             feature that could be used for multiple purposes.
>
>             Since we have time to discuss it, I would like to design it with
>             more information than we already talked.
>
>             First of all, the amount of information that this attribute can
>             contain is quite big if we expect to have volumes with
>         thousands of
>             bricks. Even in the most simple case of returning only an
>         UUID, we
>             can easily go beyond the limit of 64KB.
>
>             Consider also, for example, what shard should return when
>         pathinfo
>             is requested for a file. Probably it should return a list of
>         shards,
>             each one with all its associated pathinfo. We are talking
>         about big
>             amounts of data here.
>
>             I think this kind of information doesn't fit very well in an
>             extended attribute. Another think to consider is that most
>         probably
>             the requester of the data only needs a fragment of it, so we are
>             generating big amounts of data only to be parsed and reduced
>         later,
>             dismissing most of it.
>
>             What do you think about using a very special virtual file to
>         manage
>             all this information ? it could be easily read using normal read
>             fops, so it could manage big amounts of data easily. Also,
>         accessing
>             only to some parts of the file we could go directly where we
>         want,
>             avoiding the read of all remaining data.
>
>             A very basic idea could be this:
>
>             Each xlator would have a reserved area of the file. We can
>         reserve
>             up to 4GB per xlator (32 bits). The remaining 32 bits of the
>         offset
>             would indicate the xlator we want to access.
>
>             At offset 0 we have generic information about the volume.
>         One of the
>             the things that this information should include is a basic
>         hierarchy
>             of the whole volume and the offset for each xlator.
>
>             After reading this, the user will seek to the desired offset and
>             read the information related to the xlator it is interested in.
>
>             All the information should be stored in a format easily
>         extensible
>             that will be kept compatible even if new information is
>         added in the
>             future (for example doing special mappings of the 32 bits
>         offsets
>             reserved for the xlator).
>
>             For example we can reserve the first megabyte of the xlator
>         area to
>             have a mapping of attributes with its respective offset.
>
>             I think that using a binary format would simplify all this a
>         lot.
>
>             Do you think this is a way to explore or should I stop
>         wasting time
>             here ?
>
>
>         I think this just became a very big feature :-). Shall we just
>         live with
>         it the way it is now?
>
>
>     I supposed it...
>
>     Only thing we need to check is if shard needs to handle this xattr.
>     If so, what it should return ? only the UUID's corresponding to the
>     first shard or the UUID's of all bricks containing at least one
>     shard ? I guess that the first one is enough, but just to be sure...
>
>     My proposal was to implement a new xattr, for example
>     glusterfs.layout, that contains enough information to be usable in
>     all current use cases.
>
>
> Actually pathinfo is supposed to give this information and it already
> has the following format: for a 5x2 distributed-replicate volume

Yes, I know. I wanted to unify all information.

>
> root at dhcp35-190 - /mnt/v3
> 13:38:12 :) ⚡ getfattr -n trusted.glusterfs.pathinfo d
> # file: d
> trusted.glusterfs.pathinfo="((<DISTRIBUTE:v3-dht>
> (<REPLICATE:v3-replicate-0>
> <POSIX(/home/gfs/v3_0):dhcp35-190.lab.eng.blr.redhat.com:/home/gfs/v3_0/d>
> <POSIX(/home/gfs/v3_1):dhcp35-190.lab.eng.blr.redhat.com:/home/gfs/v3_1/d>)
> (<REPLICATE:v3-replicate-2>
> <POSIX(/home/gfs/v3_5):dhcp35-190.lab.eng.blr.redhat.com:/home/gfs/v3_5/d>
> <POSIX(/home/gfs/v3_4):dhcp35-190.lab.eng.blr.redhat.com:/home/gfs/v3_4/d>)
> (<REPLICATE:v3-replicate-1>
> <POSIX(/home/gfs/v3_3):dhcp35-190.lab.eng.blr.redhat.com:/home/gfs/v3_3/d>
> <POSIX(/home/gfs/v3_2):dhcp35-190.lab.eng.blr.redhat.com:/home/gfs/v3_2/d>)
> (<REPLICATE:v3-replicate-4>
> <POSIX(/home/gfs/v3_8):dhcp35-190.lab.eng.blr.redhat.com:/home/gfs/v3_8/d>
> <POSIX(/home/gfs/v3_9):dhcp35-190.lab.eng.blr.redhat.com:/home/gfs/v3_9/d>)
> (<REPLICATE:v3-replicate-3>
> <POSIX(/home/gfs/v3_6):dhcp35-190.lab.eng.blr.redhat.com:/home/gfs/v3_6/d>
> <POSIX(/home/gfs/v3_7):dhcp35-190.lab.eng.blr.redhat.com:/home/gfs/v3_7/d>))
> (v3-dht-layout (v3-replicate-0 0 858993458) (v3-replicate-1 858993459
> 1717986917) (v3-replicate-2 1717986918 2576980376) (v3-replicate-3
> 2576980377 3435973835) (v3-replicate-4 3435973836 4294967295)))"
>
>
> root at dhcp35-190 - /mnt/v3
> 13:38:26 :) ⚡ getfattr -n trusted.glusterfs.pathinfo d/a
> # file: d/a
> trusted.glusterfs.pathinfo="(<DISTRIBUTE:v3-dht>
> (<REPLICATE:v3-replicate-1>
> <POSIX(/home/gfs/v3_3):dhcp35-190.lab.eng.blr.redhat.com:/home/gfs/v3_3/d/a>
> <POSIX(/home/gfs/v3_2):dhcp35-190.lab.eng.blr.redhat.com:/home/gfs/v3_2/d/a>))"
>
>
>
>
>     The idea would be that each xlator that makes a significant change
>     in the way or the place where files are stored, should put
>     information in this xattr. The information should include:
>
>     * Type (basically AFR, EC, DHT, ...)
>     * Basic configuration (replication and arbiter for AFR, data and
>     redundancy for EC, # subvolumes for DHT, shard size for sharding, ...)
>     * Quorum imposed by the xlator
>     * UUID data comming from subvolumes (sorted by brick position)
>     * It should be easily extensible in the future
>
>     The last point is very important to avoid the issues we have seen
>     now. We must be able to incorporate more information without
>     breaking backward compatibility. To do so, we can add tags for each
>     value.
>
>     For example, a distribute 2, replica 2 volume with 1 arbiter should
>     be represented by this string:
>
>        DHT[dist=2,quorum=1](
>           AFR[rep=2,arbiter=1,quorum=2](
>              NODE[quorum=2,uuid=<UUID1>](<path1>),
>              NODE[quorum=2,uuid=<UUID2>](<path2>),
>              NODE[quorum=2,uuid=<UUID3>](<path3>)
>           ),
>           AFR[rep=2,arbiter=1,quorum=2](
>              NODE[quorum=2,uuid=<UUID4>](<path4>),
>              NODE[quorum=2,uuid=<UUID5>](<path5>),
>              NODE[quorum=2,uuid=<UUID6>](<path6>)
>           )
>        )
>
>     Some explanations:
>
>     AFAIK DHT doesn't have quorum, so the default is '1'. We may decide
>     to omit it when it's '1' for any xlator.
>
>     Quorum in AFR represents client-side enforced quorum. Quorum in NODE
>     represents the server-side enforced quorum.
>
>     The <path> shown in each NODE represents the physical location of
>     the file (similar to current glusterfs.pathinfo) because this xattr
>     can be retrieved for a particular file using getxattr. This is nice,
>     but we can remove it for now if it's difficult to implement.
>
>     We can decide to have a verbose string or try to omit some fields
>     when not strictly necessary. For example, if there are no arbiters,
>     we can omit the 'arbiter' tag instead of writing 'arbiter=0'. We
>     could also implicitly compute 'dist' and 'rep' from the number of
>     elements contained between '()'.
>
>     What do you think ?
>
>
> Quite a few people are already familiar with path-info. So I am of the
> opinion that we give this information for that xattr itself. This xattr
> hasn't changed after quorum/arbiter/shard came in, so may be they should?

Not sure how easy would it be to change the format of path-info to 
incorporate the new information without breaking existing features or 
even user scripts based on it. Maybe a new xattr would be easier to 
implement and adapt.

I missed one important thing in the format: an xlator may have 
per-subvolume information. This information can be placed just before 
each subvolume information:

    DHT[dist=2,quorum=1](
       [hash-range=0x00000000-0x7fffffff]AFR[...](...),
       [hash-range=0x80000000-0xffffffff]AFR[...](...)
    )

Xavi

>
>
>
>     Xavi
>
>
>
>
>             Xavi
>
>
>
>
>                 On Wed, Jun 21, 2017 at 2:08 PM, Karthik Subrahmanya
>                 <ksubrahm at redhat.com <mailto:ksubrahm at redhat.com>
>         <mailto:ksubrahm at redhat.com <mailto:ksubrahm at redhat.com>>
>                 <mailto:ksubrahm at redhat.com <mailto:ksubrahm at redhat.com>
>         <mailto:ksubrahm at redhat.com <mailto:ksubrahm at redhat.com>>>> wrote:
>
>
>
>                     On Wed, Jun 21, 2017 at 1:56 PM, Xavier Hernandez
>                     <xhernandez at datalab.es
>         <mailto:xhernandez at datalab.es> <mailto:xhernandez at datalab.es
>         <mailto:xhernandez at datalab.es>>
>                 <mailto:xhernandez at datalab.es
>         <mailto:xhernandez at datalab.es> <mailto:xhernandez at datalab.es
>         <mailto:xhernandez at datalab.es>>>>
>                 wrote:
>
>                         That's ok. I'm currently unable to write a patch for
>                 this on ec.
>
>                     Sunil is working on this patch.
>
>                     ~Karthik
>
>                         If no one can do it, I can try to do it in 6 - 7
>         hours...
>
>                         Xavi
>
>
>                         On Wednesday, June 21, 2017 09:48 CEST, Pranith
>         Kumar
>                 Karampuri
>                         <pkarampu at redhat.com
>         <mailto:pkarampu at redhat.com> <mailto:pkarampu at redhat.com
>         <mailto:pkarampu at redhat.com>>
>                 <mailto:pkarampu at redhat.com <mailto:pkarampu at redhat.com>
>         <mailto:pkarampu at redhat.com <mailto:pkarampu at redhat.com>>>> wrote:
>
>
>
>                             On Wed, Jun 21, 2017 at 1:00 PM, Xavier
>         Hernandez
>                             <xhernandez at datalab.es
>         <mailto:xhernandez at datalab.es>
>                     <mailto:xhernandez at datalab.es
>         <mailto:xhernandez at datalab.es>> <mailto:xhernandez at datalab.es
>         <mailto:xhernandez at datalab.es>
>                     <mailto:xhernandez at datalab.es
>         <mailto:xhernandez at datalab.es>>>> wrote:
>
>                                 I'm ok with reverting node-uuid content
>         to the
>                     previous
>                                 format and create a new xattr for the
>         new format.
>                                 Currently, only rebalance will use it.
>
>                                 Only thing to consider is what can
>         happen if we
>                     have a
>                                 half upgraded cluster where some clients
>         have
>                     this change
>                                 and some not. Can rebalance work in this
>                     situation ? if
>                                 so, could there be any issue ?
>
>
>                             I think there shouldn't be any problem,
>         because this is
>                             in-memory xattr so layers below afr/ec will
>         only see
>                     node-uuid
>                             xattr.
>                             This also gives us a chance to do whatever
>         we want
>                     to do in
>                             future with this xattr without any problems
>         about
>                     backward
>                             compatibility.
>
>                             You can check
>
>
>         https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/src/afr-inode-read.c@1507
>         <https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/src/afr-inode-read.c@1507>
>
>         <https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/src/afr-inode-read.c@1507
>         <https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/src/afr-inode-read.c@1507>>
>
>
>         <https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/src/afr-inode-read.c@1507
>         <https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/src/afr-inode-read.c@1507>
>
>         <https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/src/afr-inode-read.c@1507
>         <https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/src/afr-inode-read.c@1507>>>
>                             for how karthik implemented this in AFR
>         (this got merged
>                             accidentally yesterday, but looks like this
>         is what
>                     we are
>                             settling on)
>
>
>
>                                 Xavi
>
>
>                                 On Wednesday, June 21, 2017 06:56 CEST,
>         Pranith
>                     Kumar
>                                 Karampuri <pkarampu at redhat.com
>         <mailto:pkarampu at redhat.com>
>                     <mailto:pkarampu at redhat.com
>         <mailto:pkarampu at redhat.com>>
>                                 <mailto:pkarampu at redhat.com
>         <mailto:pkarampu at redhat.com>
>                     <mailto:pkarampu at redhat.com
>         <mailto:pkarampu at redhat.com>>>> wrote:
>
>
>
>                                     On Wed, Jun 21, 2017 at 10:07 AM, Nithya
>                         Balachandran
>                                     <nbalacha at redhat.com
>         <mailto:nbalacha at redhat.com>
>                         <mailto:nbalacha at redhat.com
>         <mailto:nbalacha at redhat.com>> <mailto:nbalacha at redhat.com
>         <mailto:nbalacha at redhat.com>
>                         <mailto:nbalacha at redhat.com
>         <mailto:nbalacha at redhat.com>>>> wrote:
>
>
>                                         On 20 June 2017 at 20:38, Aravinda
>                                         <avishwan at redhat.com
>         <mailto:avishwan at redhat.com>
>                         <mailto:avishwan at redhat.com
>         <mailto:avishwan at redhat.com>> <mailto:avishwan at redhat.com
>         <mailto:avishwan at redhat.com>
>                         <mailto:avishwan at redhat.com
>         <mailto:avishwan at redhat.com>>>> wrote:
>
>                                             On 06/20/2017 06:02 PM, Pranith
>                         Kumar Karampuri
>                                             wrote:
>
>                                                 Xavi, Aravinda and I had a
>                             discussion on
>                                                 #gluster-dev and we
>         agreed to go
>                             with the format
>                                                 Aravinda suggested for
>         now and
>                             in future we
>                                                 wanted some more changes
>         for dht
>                             to detect which
>                                                 subvolume went down came
>         back
>                             up, at that time
>                                                 we will revisit the solution
>                             suggested by Xavi.
>
>                                                 Susanth is doing the dht
>         changes
>                                                 Aravinda is doing
>         geo-rep changes
>
>                                             Done. Geo-rep patch sent for
>         review
>
>         https://review.gluster.org/17582 <https://review.gluster.org/17582>
>                         <https://review.gluster.org/17582
>         <https://review.gluster.org/17582>>
>
>         <https://review.gluster.org/17582 <https://review.gluster.org/17582>
>                         <https://review.gluster.org/17582
>         <https://review.gluster.org/17582>>>
>
>
>
>                                         The proposed changes to the
>         node-uuid
>                         behaviour
>                                         (while good) are going to break
>         tiering
>                         . Tiering
>                                         changes will take a little more
>         time to
>                         be coded and
>                                         tested.
>
>                                         As this is a regression for 3.11
>         and a
>                         blocker for
>                                         3.11.1, I suggest we go back to
>         the original
>                                         node-uuid behaviour for now so as to
>                         unblock the
>                                         release and target the proposed
>         changes
>                         for the next
>                                         3.11 releases.
>
>
>                                     Let me see if I understand the changes
>                         correctly. We are
>                                     restoring the behavior of node-uuid
>         xattr
>                         and adding a
>                                     new xattr for parallel rebalance for
>         both
>                         afr and ec,
>                                     correct? Otherwise that is one more
>                         regression. If yes,
>                                     we will also wait for Xavi's inputs.
>         Jeff
>                         accidentally
>                                     merged the afr patch yesterday which
>         does
>                         these changes.
>                                     If everyone is in agreement, we will
>         leave
>                         it as is and
>                                     add similar changes in ec as well.
>         If we are
>                         not in
>                                     agreement, then we will let the
>         discussion
>                         progress :-)
>
>
>
>
>                                         Regards,
>                                         Nithya
>
>                                             --
>                                             Aravinda
>
>
>                                                 Thanks to all of you
>         guys for
>                             the discussions!
>
>                                                 On Tue, Jun 20, 2017 at
>         5:05 PM,
>                             Xavier
>                                                 Hernandez
>         <xhernandez at datalab.es <mailto:xhernandez at datalab.es>
>                             <mailto:xhernandez at datalab.es
>         <mailto:xhernandez at datalab.es>>
>
>         <mailto:xhernandez at datalab.es <mailto:xhernandez at datalab.es>
>                             <mailto:xhernandez at datalab.es
>         <mailto:xhernandez at datalab.es>>>> wrote:
>
>                                                     Hi Aravinda,
>
>                                                     On 20/06/17 12:42,
>         Aravinda
>                             wrote:
>
>                                                         I think
>         following format
>                             can be easily
>                                                         adopted by all
>         components
>
>                                                         UUIDs of a
>         subvolume are
>                             seperated by
>                                                         space and
>         subvolumes are
>                             separated
>                                                         by comma
>
>                                                         For example,
>         node1 and
>                             node2 are replica
>                                                         with U1 and U2 UUIDs
>                                                         respectively and
>                                                         node3 and node4 are
>                             replica with U3 and
>                                                         U4 UUIDs
>         respectively
>
>                                                         node-uuid can
>         return "U1
>                             U2,U3 U4"
>
>
>                                                     While this is ok for
>         current
>                             implementation,
>                                                     I think this can be
>                             insufficient if there
>                                                     are more layers of
>         xlators
>                             that require to
>                                                     indicate some sort of
>                             grouping. Some
>                                                     representation that can
>                             represent hierarchy
>                                                     would be better. For
>                             example: "(U1 U2) (U3
>                                                     U4)" (we can use
>         spaces or
>                             comma as a
>                                                     separator).
>
>
>
>                                                         Geo-rep can
>         split by ","
>                             and then split
>                                                         by space and
>         take first UUID
>                                                         DHT can split
>         the value
>                             by space or
>                                                         comma and get unique
>                             UUIDs list
>
>
>                                                     This doesn't solve the
>                             problem I described
>                                                     in the previous
>         email. Some
>                             more logic will
>                                                     need to be added to
>         avoid
>                             more than one node
>                                                     from each
>         replica-set to be
>                             active. If we
>                                                     have some explicit
>         hierarchy
>                             information in
>                                                     the node-uuid value,
>         more
>                             decisions can be
>                                                     taken.
>
>                                                     An initial proposal
>         I made
>                             was this:
>
>
>         DHT[2](AFR[2,0](NODE(U1),
>                             NODE(U2)),
>                                                     AFR[2,0](NODE(U1),
>         NODE(U2)))
>
>                                                     This is harder to
>         parse, but
>                             gives a lot of
>                                                     information: DHT with 2
>                             subvolumes, each
>                                                     subvolume is an AFR with
>                             replica 2 and no
>                                                     arbiters. It's also
>         easily
>                             extensible with
>                                                     any new xlator that
>         changes
>                             the layout.
>
>                                                     However maybe this
>         is not
>                             the moment to do
>                                                     this, and probably
>         we could
>                             implement this
>                                                     in a new xattr with
>         a better
>                             name.
>
>                                                     Xavi
>
>
>
>                                                         Another question is
>                             about the behavior
>                                                         when a node is down,
>                             existing
>                                                         node-uuid xattr
>         will not
>                             return that
>                                                         UUID if a node
>         is down.
>                             What is the
>                                                         behavior with the
>                             proposed xattr?
>
>                                                         Let me know your
>         thoughts.
>
>                                                         regards
>                                                         Aravinda VK
>
>                                                         On 06/20/2017
>         03:06 PM,
>                             Aravinda wrote:
>
>                                                             Hi Xavi,
>
>                                                             On
>         06/20/2017 02:51
>                             PM, Xavier
>                                                             Hernandez wrote:
>
>                                                                 Hi Aravinda,
>
>                                                                 On 20/06/17
>                             11:05, Pranith Kumar
>
>         Karampuri wrote:
>
>
>         Adding more
>                             people to get a
>
>         consensus
>                             about this.
>
>                                                                     On
>         Tue, Jun
>                             20, 2017 at 1:49
>                                                                     PM,
>         Aravinda
>
>                             <avishwan at redhat.com
>         <mailto:avishwan at redhat.com> <mailto:avishwan at redhat.com
>         <mailto:avishwan at redhat.com>>
>
>                             <mailto:avishwan at redhat.com
>         <mailto:avishwan at redhat.com>
>                             <mailto:avishwan at redhat.com
>         <mailto:avishwan at redhat.com>>>
>
>                             <mailto:avishwan at redhat.com
>         <mailto:avishwan at redhat.com> <mailto:avishwan at redhat.com
>         <mailto:avishwan at redhat.com>>
>
>                             <mailto:avishwan at redhat.com
>         <mailto:avishwan at redhat.com>
>                             <mailto:avishwan at redhat.com
>         <mailto:avishwan at redhat.com>>>>>
>                                                                     wrote:
>
>
>
>         regards
>
>         Aravinda VK
>
>
>                                                                         On
>                             06/20/2017 01:26 PM,
>                                                                     Xavier
>                             Hernandez wrote:
>
>
>             Hi
>                             Pranith,
>
>
>             adding
>
>                             gluster-devel, Kotresh and
>
>         Aravinda,
>
>
>             On
>                             20/06/17 09:45,
>                                                                     Pranith
>                             Kumar Karampuri wrote:
>
>
>
>
>                             On Tue, Jun 20,
>                                                                     2017
>         at 1:12
>                             PM, Xavier
>
>         Hernandez
>
>
>                             <xhernandez at datalab.es
>         <mailto:xhernandez at datalab.es> <mailto:xhernandez at datalab.es
>         <mailto:xhernandez at datalab.es>>
>
>                             <mailto:xhernandez at datalab.es
>         <mailto:xhernandez at datalab.es>
>                             <mailto:xhernandez at datalab.es
>         <mailto:xhernandez at datalab.es>>>
>
>                             <mailto:xhernandez at datalab.es
>         <mailto:xhernandez at datalab.es>
>                             <mailto:xhernandez at datalab.es
>         <mailto:xhernandez at datalab.es>>
>                             <mailto:xhernandez at datalab.es
>         <mailto:xhernandez at datalab.es>
>                             <mailto:xhernandez at datalab.es
>         <mailto:xhernandez at datalab.es>>>>
>
>
>                             <mailto:xhernandez at datalab.es
>         <mailto:xhernandez at datalab.es>
>                             <mailto:xhernandez at datalab.es
>         <mailto:xhernandez at datalab.es>>
>                             <mailto:xhernandez at datalab.es
>         <mailto:xhernandez at datalab.es>
>                             <mailto:xhernandez at datalab.es
>         <mailto:xhernandez at datalab.es>>>
>
>
>                             <mailto:xhernandez at datalab.es
>         <mailto:xhernandez at datalab.es>
>                             <mailto:xhernandez at datalab.es
>         <mailto:xhernandez at datalab.es>>
>                             <mailto:xhernandez at datalab.es
>         <mailto:xhernandez at datalab.es>
>                             <mailto:xhernandez at datalab.es
>         <mailto:xhernandez at datalab.es>>>>>>
>                                                                     wrote:
>
>
>                                 On 20/06/17
>                                                                     09:31,
>                             Pranith Kumar
>
>         Karampuri wrote:
>
>
>                                     The way
>
>                             geo-replication works is:
>
>                                     On each
>
>         machine, it
>                             does getxattr of
>
>         node-uuid and
>
>                             check if its
>
>                                     own uuid
>
>                                     is
>
>         present in
>                             the list. If it
>                                                                     is
>         present
>                             then it
>
>                             will consider
>
>                                     it active
>
>
>         otherwise it
>                             will be
>
>         considered
>                             passive. With this
>
>                             change we are
>
>                                     giving
>
>                                     all
>                                                                     uuids
>                             instead of first-up
>
>         subvolume.
>                             So all
>
>                             machines think
>
>                                     they are
>
>                                     ACTIVE
>
>         which is bad
>                             apparently. So
>                                                                     that
>         is the
>
>                             reason. Even I
>
>                                     felt bad
>
>                                     that we
>                                                                     are
>         doing
>                             this change.
>
>
>
>                                 And what
>                                                                     about
>                             changing the content
>                                                                     of
>         node-uuid to
>
>                             include some
>
>                                 sort of
>
>         hierarchy ?
>
>
>                                 for example:
>
>
>                                 a single brick:
>
>
>                                 NODE(<guid>)
>
>
>                                 AFR/EC:
>
>
>
>                             AFR[2](NODE(<guid>),
>
>         NODE(<guid>))
>
>
>                             EC[3,1](NODE(<guid>),
>
>                             NODE(<guid>), NODE(<guid>))
>
>
>                                 DHT:
>
>
>
>                             DHT[2](AFR[2](NODE(<guid>),
>
>         NODE(<guid>)),
>
>                             AFR[2](NODE(<guid>),
>
>                                 NODE(<guid>)))
>
>
>                                 This gives a
>                                                                     lot of
>                             information that can
>                                                                     be
>         used to
>                                                                     take the
>
>                                 appropriate
>
>         decisions.
>
>
>
>                             I guess that is
>                                                                     not
>         backward
>                             compatible.
>
>         Shall I CC
>
>                             gluster-devel and
>
>                             Kotresh/Aravinda?
>
>
>
>             Is
>                             the change we did
>                                                                     backward
>                             compatible ? if we
>                                                                     only
>         require
>
>             the
>                             first field to
>                                                                     be a
>         GUID to
>                             support
>                                                                     backward
>                             compatibility,
>
>             we
>                             can use something
>                                                                     like
>         this:
>
>
>         No. But
>                             the necessary
>
>         change can
>                             be made to
>
>         Geo-rep code
>                             as well if
>
>         format
>                             is changed, Since
>                                                                     all
>         these
>                             are built/shipped
>
>         together.
>
>
>         Geo-rep
>                             uses node-id as
>                                                                     follows,
>
>
>         list =
>                             listxattr(node-uuid)
>
>                             active_node_uuids =
>
>                             list.split(SPACE)
>
>                             active_node_flag = True
>                                                                     if
>                             self.node_id exists in
>
>                             active_node_uuids
>
>         else False
>
>
>                                                                 How was this
>                             case solved ?
>
>                                                                 suppose
>         we have
>                             three servers
>                                                                 and 2
>         bricks in
>                             each server. A
>                                                                 replicated
>                             volume is created
>                                                                 using the
>                             following command:
>
>                                                                 gluster
>         volume
>                             create test
>                                                                 replica 2
>                             server1:/brick1
>
>         server2:/brick1
>
>         server2:/brick2
>                             server3:/brick1
>
>         server3:/brick1
>                             server1:/brick2
>
>                                                                 In this
>         case we
>                             have three
>
>         replica-sets:
>
>                                                                 *
>                             server1:/brick1 server2:/brick1
>                                                                 *
>                             server2:/brick2 server3:/brick1
>                                                                 *
>                             server3:/brick2 server2:/brick2
>
>                                                                 Old AFR
>                             implementation for
>
>         node-uuid always
>                             returned the
>                                                                 uuid of the
>                                                                 node of the
>                             first brick, so in
>                                                                 this case we
>                             will get the uuid
>                                                                 of the
>                                                                 three nodes
>                             because all of them
>                                                                 are the
>         first
>                             brick of a
>                                                                 replica-set.
>
>                                                                 Does
>         this mean
>                             that with this
>
>         configuration
>                             all nodes are
>                                                                 active ? Is
>                                                                 this a
>         problem ?
>                             Is there any
>                                                                 other
>         check to
>                             avoid this
>                                                                 situation if
>                                                                 it's not
>         good ?
>
>                                                             Yes all Geo-rep
>                             workers will become
>                                                             Active and
>                             participate in syncing.
>                                                             Since changelogs
>                             will have the same
>                                                             information in
>                             replica bricks this
>                                                             will lead to
>                             duplicate syncing and
>                                                             consuming
>         network
>                             bandwidth.
>
>                                                             Node-uuid based
>                             Active worker is the
>                                                             default
>                             configuration in Geo-rep
>                                                             till now,
>         Geo-rep
>                             also has Meta
>                                                             Volume based
>                             syncronization for Active
>                                                             worker using
>         lock
>                             files.(Can be
>                                                             opted using
>         Geo-rep
>                             configuration,
>                                                             with this config
>                             node-uuid will not
>                                                             be used)
>
>                                                             Kotresh
>         proposed a
>                             solution to
>                                                             configure which
>                             worker to become
>                                                             Active. This
>         will
>                             give more control
>                                                             to Admin to
>         choose
>                             Active workers,
>                                                             This will
>         become default
>
>         configuration from 3.12
>
>
>         https://github.com/gluster/glusterfs/issues/244
>         <https://github.com/gluster/glusterfs/issues/244>
>
>         <https://github.com/gluster/glusterfs/issues/244
>         <https://github.com/gluster/glusterfs/issues/244>>
>
>
>         <https://github.com/gluster/glusterfs/issues/244
>         <https://github.com/gluster/glusterfs/issues/244>
>
>         <https://github.com/gluster/glusterfs/issues/244
>         <https://github.com/gluster/glusterfs/issues/244>>>
>
>                                                             --
>                                                             Aravinda
>
>
>
>                                                                 Xavi
>
>
>
>
>
>
>             Bricks:
>
>
>             <guid>
>
>
>             AFR/EC:
>
>                             <guid>(<guid>, <guid>)
>
>
>             DHT:
>
>
>                             <guid>(<guid>(<guid>, ...),
>
>                             <guid>(<guid>, ...))
>
>
>             In
>                             this case, AFR
>                                                                     and
>         EC would
>                             return the same
>
>         <guid> they
>
>                             returned before the
>
>         patch, but
>                             between '(' and
>                                                                     ')'
>         they put the
>
>             full
>                             list of guid's
>                                                                     of all
>                             nodes. The first
>
>         <guid> can
>                             be used
>
>             by
>                             geo-replication.
>                                                                     The list
>                             after the first
>
>         <guid> can
>                             be used
>
>             for
>                             rebalance.
>
>
>             Not
>                             sure if there's
>                                                                     any
>         user of
>                             node-uuid above DHT.
>
>
>             Xavi
>
>
>
>
>
>                                 Xavi
>
>
>
>                                     On Tue,
>                                                                     Jun
>         20, 2017
>                             at 12:46 PM,
>
>         Xavier Hernandez
>
>
>                             <xhernandez at datalab.es
>         <mailto:xhernandez at datalab.es> <mailto:xhernandez at datalab.es
>         <mailto:xhernandez at datalab.es>>
>
>                             <mailto:xhernandez at datalab.es
>         <mailto:xhernandez at datalab.es>
>                             <mailto:xhernandez at datalab.es
>         <mailto:xhernandez at datalab.es>>>
>
>
>                             <mailto:xhernandez at datalab.es
>         <mailto:xhernandez at datalab.es>
>                             <mailto:xhernandez at datalab.es
>         <mailto:xhernandez at datalab.es>>
>                             <mailto:xhernandez at datalab.es
>         <mailto:xhernandez at datalab.es>
>                             <mailto:xhernandez at datalab.es
>         <mailto:xhernandez at datalab.es>>>>
>
>                             <mailto:xhernandez at datalab.es
>         <mailto:xhernandez at datalab.es>
>                             <mailto:xhernandez at datalab.es
>         <mailto:xhernandez at datalab.es>>
>                             <mailto:xhernandez at datalab.es
>         <mailto:xhernandez at datalab.es>
>                             <mailto:xhernandez at datalab.es
>         <mailto:xhernandez at datalab.es>>>
>
>
>                             <mailto:xhernandez at datalab.es
>         <mailto:xhernandez at datalab.es>
>                             <mailto:xhernandez at datalab.es
>         <mailto:xhernandez at datalab.es>>
>                             <mailto:xhernandez at datalab.es
>         <mailto:xhernandez at datalab.es>
>                             <mailto:xhernandez at datalab.es
>         <mailto:xhernandez at datalab.es>>>>>
>
>
>                             <mailto:xhernandez at datalab.es
>         <mailto:xhernandez at datalab.es>
>                             <mailto:xhernandez at datalab.es
>         <mailto:xhernandez at datalab.es>>
>                             <mailto:xhernandez at datalab.es
>         <mailto:xhernandez at datalab.es>
>                             <mailto:xhernandez at datalab.es
>         <mailto:xhernandez at datalab.es>>>
>
>
>                             <mailto:xhernandez at datalab.es
>         <mailto:xhernandez at datalab.es>
>                             <mailto:xhernandez at datalab.es
>         <mailto:xhernandez at datalab.es>>
>                             <mailto:xhernandez at datalab.es
>         <mailto:xhernandez at datalab.es>
>                             <mailto:xhernandez at datalab.es
>         <mailto:xhernandez at datalab.es>>>>
>
>                             <mailto:xhernandez at datalab.es
>         <mailto:xhernandez at datalab.es>
>                             <mailto:xhernandez at datalab.es
>         <mailto:xhernandez at datalab.es>>
>                             <mailto:xhernandez at datalab.es
>         <mailto:xhernandez at datalab.es>
>                             <mailto:xhernandez at datalab.es
>         <mailto:xhernandez at datalab.es>>>
>
>
>                             <mailto:xhernandez at datalab.es
>         <mailto:xhernandez at datalab.es>
>                             <mailto:xhernandez at datalab.es
>         <mailto:xhernandez at datalab.es>>
>                             <mailto:xhernandez at datalab.es
>         <mailto:xhernandez at datalab.es>
>                             <mailto:xhernandez at datalab.es
>         <mailto:xhernandez at datalab.es>>>>>>>
>
>
>                                     wrote:
>
>
>                                         Hi
>                                                                     Pranith,
>
>
>                                         On
>                                                                     20/06/17
>                             07:53, Pranith
>
>         Kumar Karampuri
>                                                                     wrote:
>
>
>                                                                     hi Xavi,
>
>
>            We
>                             all made the
>
>         mistake of not
>
>                             sending about
>                                                                     changing
>
>
>         behavior of
>
>
>         node-uuid
>                             xattr so that
>
>         rebalance
>                             can use
>
>                             multiple nodes
>
>                                     for doing
>
>
>         rebalance.
>                             Because of this
>                                                                     on
>         geo-rep all
>
>                             the workers
>
>                                     are becoming
>
>                                                                     active
>                             instead of one per
>                                                                     EC/AFR
>                             subvolume.
>
>                             So we are
>
>
>         frantically
>                             trying
>
>                                                                     to
>         restore
>                             the functionality
>                                                                     of
>         node-uuid
>
>                             and introduce
>
>                                     a new
>
>
>         xattr for
>
>
>
>
> --
> Pranith



More information about the Gluster-devel mailing list