[Gluster-devel] geo-rep regression because of node-uuid change

Tue Jun 20 15:08:58 UTC 2017

On 06/20/2017 06:02 PM, Pranith Kumar Karampuri wrote:
> Xavi, Aravinda and I had a discussion on #gluster-dev and we agreed to 
> go with the format Aravinda suggested for now and in future we wanted 
> some more changes for dht to detect which subvolume went down came 
> back up, at that time we will revisit the solution suggested by Xavi.
>
> Susanth is doing the dht changes
> Aravinda is doing geo-rep changes
Done. Geo-rep patch sent for review https://review.gluster.org/17582

--
Aravinda

>
> Thanks to all of you guys for the discussions!
>
> On Tue, Jun 20, 2017 at 5:05 PM, Xavier Hernandez 
> <xhernandez at datalab.es <mailto:xhernandez at datalab.es>> wrote:
>
>     Hi Aravinda,
>
>     On 20/06/17 12:42, Aravinda wrote:
>
>         I think following format can be easily adopted by all components
>
>         UUIDs of a subvolume are seperated by space and subvolumes are
>         separated
>         by comma
>
>         For example, node1 and node2 are replica with U1 and U2 UUIDs
>         respectively and
>         node3 and node4 are replica with U3 and U4 UUIDs respectively
>
>         node-uuid can return "U1 U2,U3 U4"
>
>
>     While this is ok for current implementation, I think this can be
>     insufficient if there are more layers of xlators that require to
>     indicate some sort of grouping. Some representation that can
>     represent hierarchy would be better. For example: "(U1 U2) (U3
>     U4)" (we can use spaces or comma as a separator).
>
>
>         Geo-rep can split by "," and then split by space and take
>         first UUID
>         DHT can split the value by space or comma and get unique UUIDs
>         list
>
>
>     This doesn't solve the problem I described in the previous email.
>     Some more logic will need to be added to avoid more than one node
>     from each replica-set to be active. If we have some explicit
>     hierarchy information in the node-uuid value, more decisions can
>     be taken.
>
>     An initial proposal I made was this:
>
>     DHT[2](AFR[2,0](NODE(U1), NODE(U2)), AFR[2,0](NODE(U1), NODE(U2)))
>
>     This is harder to parse, but gives a lot of information: DHT with
>     2 subvolumes, each subvolume is an AFR with replica 2 and no
>     arbiters. It's also easily extensible with any new xlator that
>     changes the layout.
>
>     However maybe this is not the moment to do this, and probably we
>     could implement this in a new xattr with a better name.
>
>     Xavi
>
>
>
>         Another question is about the behavior when a node is down,
>         existing
>         node-uuid xattr will not return that UUID if a node is down.
>         What is the
>         behavior with the proposed xattr?
>
>         Let me know your thoughts.
>
>         regards
>         Aravinda VK
>
>         On 06/20/2017 03:06 PM, Aravinda wrote:
>
>             Hi Xavi,
>
>             On 06/20/2017 02:51 PM, Xavier Hernandez wrote:
>
>                 Hi Aravinda,
>
>                 On 20/06/17 11:05, Pranith Kumar Karampuri wrote:
>
>                     Adding more people to get a consensus about this.
>
>                     On Tue, Jun 20, 2017 at 1:49 PM, Aravinda
>                     <avishwan at redhat.com <mailto:avishwan at redhat.com>
>                     <mailto:avishwan at redhat.com
>                     <mailto:avishwan at redhat.com>>> wrote:
>
>
>                         regards
>                         Aravinda VK
>
>
>                         On 06/20/2017 01:26 PM, Xavier Hernandez wrote:
>
>                             Hi Pranith,
>
>                             adding gluster-devel, Kotresh and Aravinda,
>
>                             On 20/06/17 09:45, Pranith Kumar Karampuri
>                     wrote:
>
>
>
>                                 On Tue, Jun 20, 2017 at 1:12 PM,
>                     Xavier Hernandez
>                                 <xhernandez at datalab.es
>                     <mailto:xhernandez at datalab.es>
>                     <mailto:xhernandez at datalab.es
>                     <mailto:xhernandez at datalab.es>>
>                                 <mailto:xhernandez at datalab.es
>                     <mailto:xhernandez at datalab.es>
>                                 <mailto:xhernandez at datalab.es
>                     <mailto:xhernandez at datalab.es>>>> wrote:
>
>                                     On 20/06/17 09:31, Pranith Kumar
>                     Karampuri wrote:
>
>                                         The way geo-replication works is:
>                                         On each machine, it does
>                     getxattr of node-uuid and
>                                 check if its
>                                         own uuid
>                                         is present in the list. If it
>                     is present then it
>                                 will consider
>                                         it active
>                                         otherwise it will be
>                     considered passive. With this
>                                 change we are
>                                         giving
>                                         all uuids instead of first-up
>                     subvolume. So all
>                                 machines think
>                                         they are
>                                         ACTIVE which is bad
>                     apparently. So that is the
>                                 reason. Even I
>                                         felt bad
>                                         that we are doing this change.
>
>
>                                     And what about changing the
>                     content of node-uuid to
>                                 include some
>                                     sort of hierarchy ?
>
>                                     for example:
>
>                                     a single brick:
>
>                                     NODE(<guid>)
>
>                                     AFR/EC:
>
>                     AFR[2](NODE(<guid>), NODE(<guid>))
>                     EC[3,1](NODE(<guid>), NODE(<guid>), NODE(<guid>))
>
>                                     DHT:
>
>                     DHT[2](AFR[2](NODE(<guid>), NODE(<guid>)),
>                                 AFR[2](NODE(<guid>),
>                                     NODE(<guid>)))
>
>                                     This gives a lot of information
>                     that can be used to
>                     take the
>                                     appropriate decisions.
>
>
>                                 I guess that is not backward
>                     compatible. Shall I CC
>                                 gluster-devel and
>                                 Kotresh/Aravinda?
>
>
>                             Is the change we did backward compatible ?
>                     if we only require
>                             the first field to be a GUID to support
>                     backward compatibility,
>                             we can use something like this:
>
>                         No. But the necessary change can be made to
>                     Geo-rep code as well if
>                         format is changed, Since all these are
>                     built/shipped together.
>
>                         Geo-rep uses node-id as follows,
>
>                         list = listxattr(node-uuid)
>                         active_node_uuids = list.split(SPACE)
>                         active_node_flag = True if self.node_id exists
>                     in active_node_uuids
>                         else False
>
>
>                 How was this case solved ?
>
>                 suppose we have three servers and 2 bricks in each
>                 server. A
>                 replicated volume is created using the following command:
>
>                 gluster volume create test replica 2 server1:/brick1
>                 server2:/brick1
>                 server2:/brick2 server3:/brick1 server3:/brick1
>                 server1:/brick2
>
>                 In this case we have three replica-sets:
>
>                 * server1:/brick1 server2:/brick1
>                 * server2:/brick2 server3:/brick1
>                 * server3:/brick2 server2:/brick2
>
>                 Old AFR implementation for node-uuid always returned
>                 the uuid of the
>                 node of the first brick, so in this case we will get
>                 the uuid of the
>                 three nodes because all of them are the first brick of
>                 a replica-set.
>
>                 Does this mean that with this configuration all nodes
>                 are active ? Is
>                 this a problem ? Is there any other check to avoid
>                 this situation if
>                 it's not good ?
>
>             Yes all Geo-rep workers will become Active and participate
>             in syncing.
>             Since changelogs will have the same information in replica
>             bricks this
>             will lead to duplicate syncing and consuming network
>             bandwidth.
>
>             Node-uuid based Active worker is the default configuration
>             in Geo-rep
>             till now, Geo-rep also has Meta Volume based
>             syncronization for Active
>             worker using lock files.(Can be opted using Geo-rep
>             configuration,
>             with this config node-uuid will not be used)
>
>             Kotresh proposed a solution to configure which worker to
>             become
>             Active. This will give more control to Admin to choose
>             Active workers,
>             This will become default configuration from 3.12
>             https://github.com/gluster/glusterfs/issues/244
>             <https://github.com/gluster/glusterfs/issues/244>
>
>             --
>             Aravinda
>
>
>                 Xavi
>
>
>
>
>                             Bricks:
>
>                             <guid>
>
>                             AFR/EC:
>                             <guid>(<guid>, <guid>)
>
>                             DHT:
>                     <guid>(<guid>(<guid>, ...), <guid>(<guid>, ...))
>
>                             In this case, AFR and EC would return the
>                     same <guid> they
>                             returned before the patch, but between '('
>                     and ')' they put the
>                             full list of guid's of all nodes. The
>                     first <guid> can be used
>                             by geo-replication. The list after the
>                     first <guid> can be used
>                             for rebalance.
>
>                             Not sure if there's any user of node-uuid
>                     above DHT.
>
>                             Xavi
>
>
>
>
>                                     Xavi
>
>
>                                         On Tue, Jun 20, 2017 at 12:46
>                     PM, Xavier Hernandez
>                                         <xhernandez at datalab.es
>                     <mailto:xhernandez at datalab.es>
>                                 <mailto:xhernandez at datalab.es
>                     <mailto:xhernandez at datalab.es>>
>                     <mailto:xhernandez at datalab.es
>                     <mailto:xhernandez at datalab.es>
>                                 <mailto:xhernandez at datalab.es
>                     <mailto:xhernandez at datalab.es>>>
>                                         <mailto:xhernandez at datalab.es
>                     <mailto:xhernandez at datalab.es>
>                                 <mailto:xhernandez at datalab.es
>                     <mailto:xhernandez at datalab.es>>
>                     <mailto:xhernandez at datalab.es
>                     <mailto:xhernandez at datalab.es>
>                                 <mailto:xhernandez at datalab.es
>                     <mailto:xhernandez at datalab.es>>>>>
>                                         wrote:
>
>                                             Hi Pranith,
>
>                                             On 20/06/17 07:53, Pranith
>                     Kumar Karampuri
>                     wrote:
>
>                                                 hi Xavi,
>                                                        We all made the
>                     mistake of not
>                                 sending about changing
>                                                 behavior of
>                                                 node-uuid xattr so
>                     that rebalance can use
>                                 multiple nodes
>                                         for doing
>                                                 rebalance. Because of
>                     this on geo-rep all
>                                 the workers
>                                         are becoming
>                                                 active instead of one
>                     per EC/AFR subvolume.
>                                 So we are
>                     frantically trying
>                                                 to restore the
>                     functionality of node-uuid
>                                 and introduce
>                                         a new
>                                                 xattr for
>                                                 the new behavior.
>                     Sunil will be sending out
>                                 a patch for
>                                         this.
>
>
>                                             Wouldn't it be better to
>                     change geo-rep
>                     behavior
>                                 to use the
>                                         new data
>                                             ? I think it's better as
>                     it's now, since it
>                                 gives more
>                                         information
>                                             to upper layers so that
>                     they can take more
>                                 accurate decisions.
>
>                                             Xavi
>
>
>                                                 --
>                                                 Pranith
>
>
>
>
>
>                                         --
>                                         Pranith
>
>
>
>
>
>                                 --
>                                 Pranith
>
>
>
>
>
>
>                     --
>                     Pranith
>
>
>
>
>
>
>
>
> -- 
> Pranith

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-devel/attachments/20170620/88ebf30f/attachment-0001.html>