[Gluster-devel] geo-rep regression because of node-uuid change

Pranith Kumar Karampuri pkarampu at redhat.com
Fri Jul 7 08:12:47 UTC 2017


On Fri, Jul 7, 2017 at 1:13 PM, Xavier Hernandez <xhernandez at datalab.es>
wrote:

> Hi Pranith,
>
> On 05/07/17 12:28, Pranith Kumar Karampuri wrote:
>
>>
>>
>> On Tue, Jul 4, 2017 at 2:26 PM, Xavier Hernandez <xhernandez at datalab.es
>> <mailto:xhernandez at datalab.es>> wrote:
>>
>>     Hi Pranith,
>>
>>     On 03/07/17 08:33, Pranith Kumar Karampuri wrote:
>>
>>         Xavi,
>>               Now that the change has been reverted, we can resume this
>>         discussion and decide on the exact format that considers, tier,
>> dht,
>>         afr, ec. People working geo-rep/dht/afr/ec had an internal
>>         discussion
>>         and we all agreed that this proposal would be a good way forward.
>> I
>>         think once we agree on the format and decide on the initial
>>         encoding/decoding functions of the xattr and this change is
>>         merged, we
>>         can send patches on afr/ec/dht and geo-rep to take it to closure.
>>
>>         Could you propose the new format you have in mind that considers
>>         all of
>>         the xlators?
>>
>>
>>     My idea was to create a new xattr not bound to any particular
>>     function but which could give enough information to be used in many
>>     places.
>>
>>     Currently we have another attribute called glusterfs.pathinfo that
>>     returns hierarchical information about the location of a file. Maybe
>>     we can extend this to unify all these attributes into a single
>>     feature that could be used for multiple purposes.
>>
>>     Since we have time to discuss it, I would like to design it with
>>     more information than we already talked.
>>
>>     First of all, the amount of information that this attribute can
>>     contain is quite big if we expect to have volumes with thousands of
>>     bricks. Even in the most simple case of returning only an UUID, we
>>     can easily go beyond the limit of 64KB.
>>
>>     Consider also, for example, what shard should return when pathinfo
>>     is requested for a file. Probably it should return a list of shards,
>>     each one with all its associated pathinfo. We are talking about big
>>     amounts of data here.
>>
>>     I think this kind of information doesn't fit very well in an
>>     extended attribute. Another think to consider is that most probably
>>     the requester of the data only needs a fragment of it, so we are
>>     generating big amounts of data only to be parsed and reduced later,
>>     dismissing most of it.
>>
>>     What do you think about using a very special virtual file to manage
>>     all this information ? it could be easily read using normal read
>>     fops, so it could manage big amounts of data easily. Also, accessing
>>     only to some parts of the file we could go directly where we want,
>>     avoiding the read of all remaining data.
>>
>>     A very basic idea could be this:
>>
>>     Each xlator would have a reserved area of the file. We can reserve
>>     up to 4GB per xlator (32 bits). The remaining 32 bits of the offset
>>     would indicate the xlator we want to access.
>>
>>     At offset 0 we have generic information about the volume. One of the
>>     the things that this information should include is a basic hierarchy
>>     of the whole volume and the offset for each xlator.
>>
>>     After reading this, the user will seek to the desired offset and
>>     read the information related to the xlator it is interested in.
>>
>>     All the information should be stored in a format easily extensible
>>     that will be kept compatible even if new information is added in the
>>     future (for example doing special mappings of the 32 bits offsets
>>     reserved for the xlator).
>>
>>     For example we can reserve the first megabyte of the xlator area to
>>     have a mapping of attributes with its respective offset.
>>
>>     I think that using a binary format would simplify all this a lot.
>>
>>     Do you think this is a way to explore or should I stop wasting time
>>     here ?
>>
>>
>> I think this just became a very big feature :-). Shall we just live with
>> it the way it is now?
>>
>
> I supposed it...
>
> Only thing we need to check is if shard needs to handle this xattr. If so,
> what it should return ? only the UUID's corresponding to the first shard or
> the UUID's of all bricks containing at least one shard ? I guess that the
> first one is enough, but just to be sure...
>
> My proposal was to implement a new xattr, for example glusterfs.layout,
> that contains enough information to be usable in all current use cases.
>

Actually pathinfo is supposed to give this information and it already has
the following format: for a 5x2 distributed-replicate volume

root at dhcp35-190 - /mnt/v3
13:38:12 :) ⚡ getfattr -n trusted.glusterfs.pathinfo d
# file: d
trusted.glusterfs.pathinfo="((<DISTRIBUTE:v3-dht>
(<REPLICATE:v3-replicate-0>
<POSIX(/home/gfs/v3_0):dhcp35-190.lab.eng.blr.redhat.com:/home/gfs/v3_0/d>
<POSIX(/home/gfs/v3_1):dhcp35-190.lab.eng.blr.redhat.com:/home/gfs/v3_1/d>)
(<REPLICATE:v3-replicate-2>
<POSIX(/home/gfs/v3_5):dhcp35-190.lab.eng.blr.redhat.com:/home/gfs/v3_5/d>
<POSIX(/home/gfs/v3_4):dhcp35-190.lab.eng.blr.redhat.com:/home/gfs/v3_4/d>)
(<REPLICATE:v3-replicate-1>
<POSIX(/home/gfs/v3_3):dhcp35-190.lab.eng.blr.redhat.com:/home/gfs/v3_3/d>
<POSIX(/home/gfs/v3_2):dhcp35-190.lab.eng.blr.redhat.com:/home/gfs/v3_2/d>)
(<REPLICATE:v3-replicate-4>
<POSIX(/home/gfs/v3_8):dhcp35-190.lab.eng.blr.redhat.com:/home/gfs/v3_8/d>
<POSIX(/home/gfs/v3_9):dhcp35-190.lab.eng.blr.redhat.com:/home/gfs/v3_9/d>)
(<REPLICATE:v3-replicate-3>
<POSIX(/home/gfs/v3_6):dhcp35-190.lab.eng.blr.redhat.com:/home/gfs/v3_6/d>
<POSIX(/home/gfs/v3_7):dhcp35-190.lab.eng.blr.redhat.com:/home/gfs/v3_7/d>))
(v3-dht-layout (v3-replicate-0 0 858993458) (v3-replicate-1 858993459
1717986917) (v3-replicate-2 1717986918 2576980376) (v3-replicate-3
2576980377 3435973835) (v3-replicate-4 3435973836 4294967295)))"


root at dhcp35-190 - /mnt/v3
13:38:26 :) ⚡ getfattr -n trusted.glusterfs.pathinfo d/a
# file: d/a
trusted.glusterfs.pathinfo="(<DISTRIBUTE:v3-dht>
(<REPLICATE:v3-replicate-1>
<POSIX(/home/gfs/v3_3):dhcp35-190.lab.eng.blr.redhat.com:/home/gfs/v3_3/d/a>
<POSIX(/home/gfs/v3_2):dhcp35-190.lab.eng.blr.redhat.com:
/home/gfs/v3_2/d/a>))"



>
> The idea would be that each xlator that makes a significant change in the
> way or the place where files are stored, should put information in this
> xattr. The information should include:
>
> * Type (basically AFR, EC, DHT, ...)
> * Basic configuration (replication and arbiter for AFR, data and
> redundancy for EC, # subvolumes for DHT, shard size for sharding, ...)
> * Quorum imposed by the xlator
> * UUID data comming from subvolumes (sorted by brick position)
> * It should be easily extensible in the future
>
> The last point is very important to avoid the issues we have seen now. We
> must be able to incorporate more information without breaking backward
> compatibility. To do so, we can add tags for each value.
>
> For example, a distribute 2, replica 2 volume with 1 arbiter should be
> represented by this string:
>
>    DHT[dist=2,quorum=1](
>       AFR[rep=2,arbiter=1,quorum=2](
>          NODE[quorum=2,uuid=<UUID1>](<path1>),
>          NODE[quorum=2,uuid=<UUID2>](<path2>),
>          NODE[quorum=2,uuid=<UUID3>](<path3>)
>       ),
>       AFR[rep=2,arbiter=1,quorum=2](
>          NODE[quorum=2,uuid=<UUID4>](<path4>),
>          NODE[quorum=2,uuid=<UUID5>](<path5>),
>          NODE[quorum=2,uuid=<UUID6>](<path6>)
>       )
>    )
>
> Some explanations:
>
> AFAIK DHT doesn't have quorum, so the default is '1'. We may decide to
> omit it when it's '1' for any xlator.
>
> Quorum in AFR represents client-side enforced quorum. Quorum in NODE
> represents the server-side enforced quorum.
>
> The <path> shown in each NODE represents the physical location of the file
> (similar to current glusterfs.pathinfo) because this xattr can be retrieved
> for a particular file using getxattr. This is nice, but we can remove it
> for now if it's difficult to implement.
>
> We can decide to have a verbose string or try to omit some fields when not
> strictly necessary. For example, if there are no arbiters, we can omit the
> 'arbiter' tag instead of writing 'arbiter=0'. We could also implicitly
> compute 'dist' and 'rep' from the number of elements contained between '()'.
>
> What do you think ?
>

Quite a few people are already familiar with path-info. So I am of the
opinion that we give this information for that xattr itself. This xattr
hasn't changed after quorum/arbiter/shard came in, so may be they should?


>
> Xavi
>
>
>>
>>
>>     Xavi
>>
>>
>>
>>
>>         On Wed, Jun 21, 2017 at 2:08 PM, Karthik Subrahmanya
>>         <ksubrahm at redhat.com <mailto:ksubrahm at redhat.com>
>>         <mailto:ksubrahm at redhat.com <mailto:ksubrahm at redhat.com>>> wrote:
>>
>>
>>
>>             On Wed, Jun 21, 2017 at 1:56 PM, Xavier Hernandez
>>             <xhernandez at datalab.es <mailto:xhernandez at datalab.es>
>>         <mailto:xhernandez at datalab.es <mailto:xhernandez at datalab.es>>>
>>         wrote:
>>
>>                 That's ok. I'm currently unable to write a patch for
>>         this on ec.
>>
>>             Sunil is working on this patch.
>>
>>             ~Karthik
>>
>>                 If no one can do it, I can try to do it in 6 - 7 hours...
>>
>>                 Xavi
>>
>>
>>                 On Wednesday, June 21, 2017 09:48 CEST, Pranith Kumar
>>         Karampuri
>>                 <pkarampu at redhat.com <mailto:pkarampu at redhat.com>
>>         <mailto:pkarampu at redhat.com <mailto:pkarampu at redhat.com>>> wrote:
>>
>>
>>
>>                     On Wed, Jun 21, 2017 at 1:00 PM, Xavier Hernandez
>>                     <xhernandez at datalab.es
>>             <mailto:xhernandez at datalab.es> <mailto:xhernandez at datalab.es
>>             <mailto:xhernandez at datalab.es>>> wrote:
>>
>>                         I'm ok with reverting node-uuid content to the
>>             previous
>>                         format and create a new xattr for the new format.
>>                         Currently, only rebalance will use it.
>>
>>                         Only thing to consider is what can happen if we
>>             have a
>>                         half upgraded cluster where some clients have
>>             this change
>>                         and some not. Can rebalance work in this
>>             situation ? if
>>                         so, could there be any issue ?
>>
>>
>>                     I think there shouldn't be any problem, because this
>> is
>>                     in-memory xattr so layers below afr/ec will only see
>>             node-uuid
>>                     xattr.
>>                     This also gives us a chance to do whatever we want
>>             to do in
>>                     future with this xattr without any problems about
>>             backward
>>                     compatibility.
>>
>>                     You can check
>>
>>             https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/s
>> rc/afr-inode-read.c at 1507
>>             <https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/
>> src/afr-inode-read.c at 1507>
>>
>>             <https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/
>> src/afr-inode-read.c at 1507
>>             <https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/
>> src/afr-inode-read.c at 1507>>
>>                     for how karthik implemented this in AFR (this got
>> merged
>>                     accidentally yesterday, but looks like this is what
>>             we are
>>                     settling on)
>>
>>
>>
>>                         Xavi
>>
>>
>>                         On Wednesday, June 21, 2017 06:56 CEST, Pranith
>>             Kumar
>>                         Karampuri <pkarampu at redhat.com
>>             <mailto:pkarampu at redhat.com>
>>                         <mailto:pkarampu at redhat.com
>>             <mailto:pkarampu at redhat.com>>> wrote:
>>
>>
>>
>>                             On Wed, Jun 21, 2017 at 10:07 AM, Nithya
>>                 Balachandran
>>                             <nbalacha at redhat.com
>>                 <mailto:nbalacha at redhat.com> <mailto:nbalacha at redhat.com
>>                 <mailto:nbalacha at redhat.com>>> wrote:
>>
>>
>>                                 On 20 June 2017 at 20:38, Aravinda
>>                                 <avishwan at redhat.com
>>                 <mailto:avishwan at redhat.com> <mailto:avishwan at redhat.com
>>                 <mailto:avishwan at redhat.com>>> wrote:
>>
>>                                     On 06/20/2017 06:02 PM, Pranith
>>                 Kumar Karampuri
>>                                     wrote:
>>
>>                                         Xavi, Aravinda and I had a
>>                     discussion on
>>                                         #gluster-dev and we agreed to go
>>                     with the format
>>                                         Aravinda suggested for now and
>>                     in future we
>>                                         wanted some more changes for dht
>>                     to detect which
>>                                         subvolume went down came back
>>                     up, at that time
>>                                         we will revisit the solution
>>                     suggested by Xavi.
>>
>>                                         Susanth is doing the dht changes
>>                                         Aravinda is doing geo-rep changes
>>
>>                                     Done. Geo-rep patch sent for review
>>                                     https://review.gluster.org/17582
>>                 <https://review.gluster.org/17582>
>>                                     <https://review.gluster.org/17582
>>                 <https://review.gluster.org/17582>>
>>
>>
>>
>>                                 The proposed changes to the node-uuid
>>                 behaviour
>>                                 (while good) are going to break tiering
>>                 . Tiering
>>                                 changes will take a little more time to
>>                 be coded and
>>                                 tested.
>>
>>                                 As this is a regression for 3.11 and a
>>                 blocker for
>>                                 3.11.1, I suggest we go back to the
>> original
>>                                 node-uuid behaviour for now so as to
>>                 unblock the
>>                                 release and target the proposed changes
>>                 for the next
>>                                 3.11 releases.
>>
>>
>>                             Let me see if I understand the changes
>>                 correctly. We are
>>                             restoring the behavior of node-uuid xattr
>>                 and adding a
>>                             new xattr for parallel rebalance for both
>>                 afr and ec,
>>                             correct? Otherwise that is one more
>>                 regression. If yes,
>>                             we will also wait for Xavi's inputs. Jeff
>>                 accidentally
>>                             merged the afr patch yesterday which does
>>                 these changes.
>>                             If everyone is in agreement, we will leave
>>                 it as is and
>>                             add similar changes in ec as well. If we are
>>                 not in
>>                             agreement, then we will let the discussion
>>                 progress :-)
>>
>>
>>
>>
>>                                 Regards,
>>                                 Nithya
>>
>>                                     --
>>                                     Aravinda
>>
>>
>>                                         Thanks to all of you guys for
>>                     the discussions!
>>
>>                                         On Tue, Jun 20, 2017 at 5:05 PM,
>>                     Xavier
>>                                         Hernandez <xhernandez at datalab.es
>>                     <mailto:xhernandez at datalab.es>
>>                                         <mailto:xhernandez at datalab.es
>>                     <mailto:xhernandez at datalab.es>>> wrote:
>>
>>                                             Hi Aravinda,
>>
>>                                             On 20/06/17 12:42, Aravinda
>>                     wrote:
>>
>>                                                 I think following format
>>                     can be easily
>>                                                 adopted by all components
>>
>>                                                 UUIDs of a subvolume are
>>                     seperated by
>>                                                 space and subvolumes are
>>                     separated
>>                                                 by comma
>>
>>                                                 For example, node1 and
>>                     node2 are replica
>>                                                 with U1 and U2 UUIDs
>>                                                 respectively and
>>                                                 node3 and node4 are
>>                     replica with U3 and
>>                                                 U4 UUIDs respectively
>>
>>                                                 node-uuid can return "U1
>>                     U2,U3 U4"
>>
>>
>>                                             While this is ok for current
>>                     implementation,
>>                                             I think this can be
>>                     insufficient if there
>>                                             are more layers of xlators
>>                     that require to
>>                                             indicate some sort of
>>                     grouping. Some
>>                                             representation that can
>>                     represent hierarchy
>>                                             would be better. For
>>                     example: "(U1 U2) (U3
>>                                             U4)" (we can use spaces or
>>                     comma as a
>>                                             separator).
>>
>>
>>
>>                                                 Geo-rep can split by ","
>>                     and then split
>>                                                 by space and take first
>> UUID
>>                                                 DHT can split the value
>>                     by space or
>>                                                 comma and get unique
>>                     UUIDs list
>>
>>
>>                                             This doesn't solve the
>>                     problem I described
>>                                             in the previous email. Some
>>                     more logic will
>>                                             need to be added to avoid
>>                     more than one node
>>                                             from each replica-set to be
>>                     active. If we
>>                                             have some explicit hierarchy
>>                     information in
>>                                             the node-uuid value, more
>>                     decisions can be
>>                                             taken.
>>
>>                                             An initial proposal I made
>>                     was this:
>>
>>                                             DHT[2](AFR[2,0](NODE(U1),
>>                     NODE(U2)),
>>                                             AFR[2,0](NODE(U1), NODE(U2)))
>>
>>                                             This is harder to parse, but
>>                     gives a lot of
>>                                             information: DHT with 2
>>                     subvolumes, each
>>                                             subvolume is an AFR with
>>                     replica 2 and no
>>                                             arbiters. It's also easily
>>                     extensible with
>>                                             any new xlator that changes
>>                     the layout.
>>
>>                                             However maybe this is not
>>                     the moment to do
>>                                             this, and probably we could
>>                     implement this
>>                                             in a new xattr with a better
>>                     name.
>>
>>                                             Xavi
>>
>>
>>
>>                                                 Another question is
>>                     about the behavior
>>                                                 when a node is down,
>>                     existing
>>                                                 node-uuid xattr will not
>>                     return that
>>                                                 UUID if a node is down.
>>                     What is the
>>                                                 behavior with the
>>                     proposed xattr?
>>
>>                                                 Let me know your thoughts.
>>
>>                                                 regards
>>                                                 Aravinda VK
>>
>>                                                 On 06/20/2017 03:06 PM,
>>                     Aravinda wrote:
>>
>>                                                     Hi Xavi,
>>
>>                                                     On 06/20/2017 02:51
>>                     PM, Xavier
>>                                                     Hernandez wrote:
>>
>>                                                         Hi Aravinda,
>>
>>                                                         On 20/06/17
>>                     11:05, Pranith Kumar
>>                                                         Karampuri wrote:
>>
>>                                                             Adding more
>>                     people to get a
>>                                                             consensus
>>                     about this.
>>
>>                                                             On Tue, Jun
>>                     20, 2017 at 1:49
>>                                                             PM, Aravinda
>>
>>                     <avishwan at redhat.com <mailto:avishwan at redhat.com>
>>
>>                     <mailto:avishwan at redhat.com
>>                     <mailto:avishwan at redhat.com>>
>>
>>                     <mailto:avishwan at redhat.com <mailto:
>> avishwan at redhat.com>
>>
>>                     <mailto:avishwan at redhat.com
>>                     <mailto:avishwan at redhat.com>>>>
>>                                                             wrote:
>>
>>
>>                                                                 regards
>>                                                                 Aravinda
>> VK
>>
>>
>>                                                                 On
>>                     06/20/2017 01:26 PM,
>>                                                             Xavier
>>                     Hernandez wrote:
>>
>>                                                                     Hi
>>                     Pranith,
>>
>>                                                                     adding
>>
>>                     gluster-devel, Kotresh and
>>                                                             Aravinda,
>>
>>                                                                     On
>>                     20/06/17 09:45,
>>                                                             Pranith
>>                     Kumar Karampuri wrote:
>>
>>
>>
>>
>>                     On Tue, Jun 20,
>>                                                             2017 at 1:12
>>                     PM, Xavier
>>                                                             Hernandez
>>
>>
>>                     <xhernandez at datalab.es <mailto:xhernandez at datalab.es>
>>
>>                     <mailto:xhernandez at datalab.es
>>                     <mailto:xhernandez at datalab.es>>
>>
>>                     <mailto:xhernandez at datalab.es
>>                     <mailto:xhernandez at datalab.es>
>>                     <mailto:xhernandez at datalab.es
>>                     <mailto:xhernandez at datalab.es>>>
>>
>>
>>                     <mailto:xhernandez at datalab.es
>>                     <mailto:xhernandez at datalab.es>
>>                     <mailto:xhernandez at datalab.es
>>                     <mailto:xhernandez at datalab.es>>
>>
>>
>>                     <mailto:xhernandez at datalab.es
>>                     <mailto:xhernandez at datalab.es>
>>                     <mailto:xhernandez at datalab.es
>>                     <mailto:xhernandez at datalab.es>>>>>
>>                                                             wrote:
>>
>>
>>                         On 20/06/17
>>                                                             09:31,
>>                     Pranith Kumar
>>                                                             Karampuri
>> wrote:
>>
>>
>>                             The way
>>
>>                     geo-replication works is:
>>
>>                             On each
>>                                                             machine, it
>>                     does getxattr of
>>                                                             node-uuid and
>>
>>                     check if its
>>
>>                             own uuid
>>
>>                             is
>>                                                             present in
>>                     the list. If it
>>                                                             is present
>>                     then it
>>
>>                     will consider
>>
>>                             it active
>>
>>                                                             otherwise it
>>                     will be
>>                                                             considered
>>                     passive. With this
>>
>>                     change we are
>>
>>                             giving
>>
>>                             all
>>                                                             uuids
>>                     instead of first-up
>>                                                             subvolume.
>>                     So all
>>
>>                     machines think
>>
>>                             they are
>>
>>                             ACTIVE
>>                                                             which is bad
>>                     apparently. So
>>                                                             that is the
>>
>>                     reason. Even I
>>
>>                             felt bad
>>
>>                             that we
>>                                                             are doing
>>                     this change.
>>
>>
>>
>>                         And what
>>                                                             about
>>                     changing the content
>>                                                             of node-uuid
>> to
>>
>>                     include some
>>
>>                         sort of
>>                                                             hierarchy ?
>>
>>
>>                         for example:
>>
>>
>>                         a single brick:
>>
>>
>>                         NODE(<guid>)
>>
>>
>>                         AFR/EC:
>>
>>
>>
>>                     AFR[2](NODE(<guid>),
>>                                                             NODE(<guid>))
>>
>>
>>                     EC[3,1](NODE(<guid>),
>>
>>                     NODE(<guid>), NODE(<guid>))
>>
>>
>>                         DHT:
>>
>>
>>
>>                     DHT[2](AFR[2](NODE(<guid>),
>>                                                             NODE(<guid>)),
>>
>>                     AFR[2](NODE(<guid>),
>>
>>                         NODE(<guid>)))
>>
>>
>>                         This gives a
>>                                                             lot of
>>                     information that can
>>                                                             be used to
>>                                                             take the
>>
>>                         appropriate
>>                                                             decisions.
>>
>>
>>
>>                     I guess that is
>>                                                             not backward
>>                     compatible.
>>                                                             Shall I CC
>>
>>                     gluster-devel and
>>
>>                     Kotresh/Aravinda?
>>
>>
>>                                                                     Is
>>                     the change we did
>>                                                             backward
>>                     compatible ? if we
>>                                                             only require
>>                                                                     the
>>                     first field to
>>                                                             be a GUID to
>>                     support
>>                                                             backward
>>                     compatibility,
>>                                                                     we
>>                     can use something
>>                                                             like this:
>>
>>                                                                 No. But
>>                     the necessary
>>                                                             change can
>>                     be made to
>>                                                             Geo-rep code
>>                     as well if
>>                                                                 format
>>                     is changed, Since
>>                                                             all these
>>                     are built/shipped
>>                                                             together.
>>
>>                                                                 Geo-rep
>>                     uses node-id as
>>                                                             follows,
>>
>>                                                                 list =
>>                     listxattr(node-uuid)
>>
>>                     active_node_uuids =
>>
>>                     list.split(SPACE)
>>
>>                     active_node_flag = True
>>                                                             if
>>                     self.node_id exists in
>>
>>                     active_node_uuids
>>                                                                 else False
>>
>>
>>                                                         How was this
>>                     case solved ?
>>
>>                                                         suppose we have
>>                     three servers
>>                                                         and 2 bricks in
>>                     each server. A
>>                                                         replicated
>>                     volume is created
>>                                                         using the
>>                     following command:
>>
>>                                                         gluster volume
>>                     create test
>>                                                         replica 2
>>                     server1:/brick1
>>                                                         server2:/brick1
>>                                                         server2:/brick2
>>                     server3:/brick1
>>                                                         server3:/brick1
>>                     server1:/brick2
>>
>>                                                         In this case we
>>                     have three
>>                                                         replica-sets:
>>
>>                                                         *
>>                     server1:/brick1 server2:/brick1
>>                                                         *
>>                     server2:/brick2 server3:/brick1
>>                                                         *
>>                     server3:/brick2 server2:/brick2
>>
>>                                                         Old AFR
>>                     implementation for
>>                                                         node-uuid always
>>                     returned the
>>                                                         uuid of the
>>                                                         node of the
>>                     first brick, so in
>>                                                         this case we
>>                     will get the uuid
>>                                                         of the
>>                                                         three nodes
>>                     because all of them
>>                                                         are the first
>>                     brick of a
>>                                                         replica-set.
>>
>>                                                         Does this mean
>>                     that with this
>>                                                         configuration
>>                     all nodes are
>>                                                         active ? Is
>>                                                         this a problem ?
>>                     Is there any
>>                                                         other check to
>>                     avoid this
>>                                                         situation if
>>                                                         it's not good ?
>>
>>                                                     Yes all Geo-rep
>>                     workers will become
>>                                                     Active and
>>                     participate in syncing.
>>                                                     Since changelogs
>>                     will have the same
>>                                                     information in
>>                     replica bricks this
>>                                                     will lead to
>>                     duplicate syncing and
>>                                                     consuming network
>>                     bandwidth.
>>
>>                                                     Node-uuid based
>>                     Active worker is the
>>                                                     default
>>                     configuration in Geo-rep
>>                                                     till now, Geo-rep
>>                     also has Meta
>>                                                     Volume based
>>                     syncronization for Active
>>                                                     worker using lock
>>                     files.(Can be
>>                                                     opted using Geo-rep
>>                     configuration,
>>                                                     with this config
>>                     node-uuid will not
>>                                                     be used)
>>
>>                                                     Kotresh proposed a
>>                     solution to
>>                                                     configure which
>>                     worker to become
>>                                                     Active. This will
>>                     give more control
>>                                                     to Admin to choose
>>                     Active workers,
>>                                                     This will become
>> default
>>                                                     configuration from
>> 3.12
>>
>>                     https://github.com/gluster/glusterfs/issues/244
>>                     <https://github.com/gluster/glusterfs/issues/244>
>>
>>                     <https://github.com/gluster/glusterfs/issues/244
>>                     <https://github.com/gluster/glusterfs/issues/244>>
>>
>>                                                     --
>>                                                     Aravinda
>>
>>
>>
>>                                                         Xavi
>>
>>
>>
>>
>>
>>
>> Bricks:
>>
>>                                                                     <guid>
>>
>>
>> AFR/EC:
>>
>>                     <guid>(<guid>, <guid>)
>>
>>                                                                     DHT:
>>
>>
>>                     <guid>(<guid>(<guid>, ...),
>>
>>                     <guid>(<guid>, ...))
>>
>>                                                                     In
>>                     this case, AFR
>>                                                             and EC would
>>                     return the same
>>                                                             <guid> they
>>
>>                     returned before the
>>                                                             patch, but
>>                     between '(' and
>>                                                             ')' they put
>> the
>>                                                                     full
>>                     list of guid's
>>                                                             of all
>>                     nodes. The first
>>                                                             <guid> can
>>                     be used
>>                                                                     by
>>                     geo-replication.
>>                                                             The list
>>                     after the first
>>                                                             <guid> can
>>                     be used
>>                                                                     for
>>                     rebalance.
>>
>>                                                                     Not
>>                     sure if there's
>>                                                             any user of
>>                     node-uuid above DHT.
>>
>>                                                                     Xavi
>>
>>
>>
>>
>>
>>                         Xavi
>>
>>
>>
>>                             On Tue,
>>                                                             Jun 20, 2017
>>                     at 12:46 PM,
>>                                                             Xavier
>> Hernandez
>>
>>
>>                     <xhernandez at datalab.es <mailto:xhernandez at datalab.es>
>>
>>                     <mailto:xhernandez at datalab.es
>>                     <mailto:xhernandez at datalab.es>>
>>
>>
>>                     <mailto:xhernandez at datalab.es
>>                     <mailto:xhernandez at datalab.es>
>>                     <mailto:xhernandez at datalab.es
>>                     <mailto:xhernandez at datalab.es>>>
>>
>>                     <mailto:xhernandez at datalab.es
>>                     <mailto:xhernandez at datalab.es>
>>                     <mailto:xhernandez at datalab.es
>>                     <mailto:xhernandez at datalab.es>>
>>
>>
>>                     <mailto:xhernandez at datalab.es
>>                     <mailto:xhernandez at datalab.es>
>>                     <mailto:xhernandez at datalab.es
>>                     <mailto:xhernandez at datalab.es>>>>
>>
>>
>>                     <mailto:xhernandez at datalab.es
>>                     <mailto:xhernandez at datalab.es>
>>                     <mailto:xhernandez at datalab.es
>>                     <mailto:xhernandez at datalab.es>>
>>
>>
>>                     <mailto:xhernandez at datalab.es
>>                     <mailto:xhernandez at datalab.es>
>>                     <mailto:xhernandez at datalab.es
>>                     <mailto:xhernandez at datalab.es>>>
>>
>>                     <mailto:xhernandez at datalab.es
>>                     <mailto:xhernandez at datalab.es>
>>                     <mailto:xhernandez at datalab.es
>>                     <mailto:xhernandez at datalab.es>>
>>
>>
>>                     <mailto:xhernandez at datalab.es
>>                     <mailto:xhernandez at datalab.es>
>>                     <mailto:xhernandez at datalab.es
>>                     <mailto:xhernandez at datalab.es>>>>>>
>>
>>
>>                             wrote:
>>
>>
>>                                 Hi
>>                                                             Pranith,
>>
>>
>>                                 On
>>                                                             20/06/17
>>                     07:53, Pranith
>>                                                             Kumar
>> Karampuri
>>                                                             wrote:
>>
>>
>>                                                             hi Xavi,
>>
>>                                                                    We
>>                     all made the
>>                                                             mistake of not
>>
>>                     sending about
>>                                                             changing
>>
>>                                                             behavior of
>>
>>                                                             node-uuid
>>                     xattr so that
>>                                                             rebalance
>>                     can use
>>
>>                     multiple nodes
>>
>>                             for doing
>>
>>                                                             rebalance.
>>                     Because of this
>>                                                             on geo-rep all
>>
>>                     the workers
>>
>>                             are becoming
>>
>>                                                             active
>>                     instead of one per
>>                                                             EC/AFR
>>                     subvolume.
>>
>>                     So we are
>>
>>                                                             frantically
>>                     trying
>>
>>                                                             to restore
>>                     the functionality
>>                                                             of node-uuid
>>
>>                     and introduce
>>
>>                             a new
>>
>>                                                             xattr for
>
>


-- 
Pranith
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-devel/attachments/20170707/b9d2230c/attachment-0001.html>


More information about the Gluster-devel mailing list