[Gluster-devel] geo-rep regression because of node-uuid change

Wed Jun 21 07:48:14 UTC 2017

On Wed, Jun 21, 2017 at 1:00 PM, Xavier Hernandez <xhernandez at datalab.es>
wrote:

> I'm ok with reverting node-uuid content to the previous format and create
> a new xattr for the new format. Currently, only rebalance will use it.
>
> Only thing to consider is what can happen if we have a half upgraded
> cluster where some clients have this change and some not. Can rebalance
> work in this situation ? if so, could there be any issue ?
>

I think there shouldn't be any problem, because this is in-memory xattr so
layers below afr/ec will only see node-uuid xattr.
This also gives us a chance to do whatever we want to do in future with
this xattr without any problems about backward compatibility.

You can check
https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/src/afr-inode-read.c@1507
for how karthik implemented this in AFR (this got merged accidentally
yesterday, but looks like this is what we are settling on)

>
> Xavi
>
>
> On Wednesday, June 21, 2017 06:56 CEST, Pranith Kumar Karampuri <
> pkarampu at redhat.com> wrote:
>
>
>
>
> On Wed, Jun 21, 2017 at 10:07 AM, Nithya Balachandran <nbalacha at redhat.com
> > wrote:
>>
>>
>> On 20 June 2017 at 20:38, Aravinda <avishwan at redhat.com> wrote:
>>>
>>> On 06/20/2017 06:02 PM, Pranith Kumar Karampuri wrote:
>>>
>>> Xavi, Aravinda and I had a discussion on #gluster-dev and we agreed to
>>> go with the format Aravinda suggested for now and in future we wanted some
>>> more changes for dht to detect which subvolume went down came back up, at
>>> that time we will revisit the solution suggested by Xavi.
>>>
>>> Susanth is doing the dht changes
>>> Aravinda is doing geo-rep changes
>>>
>>> Done. Geo-rep patch sent for review https://review.gluster.org/17582
>>>
>>>
>>
>> The proposed changes to the node-uuid behaviour (while good) are going to
>> break tiering . Tiering changes will take a little more time to be coded
>> and tested.
>>
>> As this is a regression for 3.11 and a blocker for 3.11.1, I suggest we
>> go back to the original node-uuid behaviour for now so as to unblock the
>> release and target the proposed changes for the next 3.11 releases.
>>
>
> Let me see if I understand the changes correctly. We are restoring the
> behavior of node-uuid xattr and adding a new xattr for parallel rebalance
> for both afr and ec, correct? Otherwise that is one more regression. If
> yes, we will also wait for Xavi's inputs. Jeff accidentally merged the afr
> patch yesterday which does these changes. If everyone is in agreement, we
> will leave it as is and add similar changes in ec as well. If we are not in
> agreement, then we will let the discussion progress :-)
>
>
>>
>>
>> Regards,
>> Nithya
>>
>>> --
>>> Aravinda
>>>
>>>
>>>
>>> Thanks to all of you guys for the discussions!
>>>
>>> On Tue, Jun 20, 2017 at 5:05 PM, Xavier Hernandez <xhernandez at datalab.es
>>> > wrote:
>>>>
>>>> Hi Aravinda,
>>>>
>>>> On 20/06/17 12:42, Aravinda wrote:
>>>>>
>>>>> I think following format can be easily adopted by all components
>>>>>
>>>>> UUIDs of a subvolume are seperated by space and subvolumes are
>>>>> separated
>>>>> by comma
>>>>>
>>>>> For example, node1 and node2 are replica with U1 and U2 UUIDs
>>>>> respectively and
>>>>> node3 and node4 are replica with U3 and U4 UUIDs respectively
>>>>>
>>>>> node-uuid can return "U1 U2,U3 U4"
>>>>
>>>>
>>>> While this is ok for current implementation, I think this can be
>>>> insufficient if there are more layers of xlators that require to indicate
>>>> some sort of grouping. Some representation that can represent hierarchy
>>>> would be better. For example: "(U1 U2) (U3 U4)" (we can use spaces or comma
>>>> as a separator).
>>>>
>>>>>
>>>>>
>>>>> Geo-rep can split by "," and then split by space and take first UUID
>>>>> DHT can split the value by space or comma and get unique UUIDs list
>>>>
>>>>
>>>> This doesn't solve the problem I described in the previous email. Some
>>>> more logic will need to be added to avoid more than one node from each
>>>> replica-set to be active. If we have some explicit hierarchy information in
>>>> the node-uuid value, more decisions can be taken.
>>>>
>>>> An initial proposal I made was this:
>>>>
>>>> DHT[2](AFR[2,0](NODE(U1), NODE(U2)), AFR[2,0](NODE(U1), NODE(U2)))
>>>>
>>>> This is harder to parse, but gives a lot of information: DHT with 2
>>>> subvolumes, each subvolume is an AFR with replica 2 and no arbiters. It's
>>>> also easily extensible with any new xlator that changes the layout.
>>>>
>>>> However maybe this is not the moment to do this, and probably we could
>>>> implement this in a new xattr with a better name.
>>>>
>>>> Xavi
>>>>
>>>>>
>>>>>
>>>>> Another question is about the behavior when a node is down, existing
>>>>> node-uuid xattr will not return that UUID if a node is down. What is
>>>>> the
>>>>> behavior with the proposed xattr?
>>>>>
>>>>> Let me know your thoughts.
>>>>>
>>>>> regards
>>>>> Aravinda VK
>>>>>
>>>>> On 06/20/2017 03:06 PM, Aravinda wrote:
>>>>>>
>>>>>> Hi Xavi,
>>>>>>
>>>>>> On 06/20/2017 02:51 PM, Xavier Hernandez wrote:
>>>>>>>
>>>>>>> Hi Aravinda,
>>>>>>>
>>>>>>> On 20/06/17 11:05, Pranith Kumar Karampuri wrote:
>>>>>>>>
>>>>>>>> Adding more people to get a consensus about this.
>>>>>>>>
>>>>>>>> On Tue, Jun 20, 2017 at 1:49 PM, Aravinda <avishwan at redhat.com
>>>>>>>> <mailto:avishwan at redhat.com>> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>     regards
>>>>>>>>     Aravinda VK
>>>>>>>>
>>>>>>>>
>>>>>>>>     On 06/20/2017 01:26 PM, Xavier Hernandez wrote:
>>>>>>>>
>>>>>>>>         Hi Pranith,
>>>>>>>>
>>>>>>>>         adding gluster-devel, Kotresh and Aravinda,
>>>>>>>>
>>>>>>>>         On 20/06/17 09:45, Pranith Kumar Karampuri wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>             On Tue, Jun 20, 2017 at 1:12 PM, Xavier Hernandez
>>>>>>>>             <xhernandez at datalab.es <mailto:xhernandez at datalab.es>
>>>>>>>>             <mailto:xhernandez at datalab.es
>>>>>>>>             <mailto:xhernandez at datalab.es>>> wrote:
>>>>>>>>
>>>>>>>>                 On 20/06/17 09:31, Pranith Kumar Karampuri wrote:
>>>>>>>>
>>>>>>>>                     The way geo-replication works is:
>>>>>>>>                     On each machine, it does getxattr of node-uuid
>>>>>>>> and
>>>>>>>>             check if its
>>>>>>>>                     own uuid
>>>>>>>>                     is present in the list. If it is present then it
>>>>>>>>             will consider
>>>>>>>>                     it active
>>>>>>>>                     otherwise it will be considered passive. With
>>>>>>>> this
>>>>>>>>             change we are
>>>>>>>>                     giving
>>>>>>>>                     all uuids instead of first-up subvolume. So all
>>>>>>>>             machines think
>>>>>>>>                     they are
>>>>>>>>                     ACTIVE which is bad apparently. So that is the
>>>>>>>>             reason. Even I
>>>>>>>>                     felt bad
>>>>>>>>                     that we are doing this change.
>>>>>>>>
>>>>>>>>
>>>>>>>>                 And what about changing the content of node-uuid to
>>>>>>>>             include some
>>>>>>>>                 sort of hierarchy ?
>>>>>>>>
>>>>>>>>                 for example:
>>>>>>>>
>>>>>>>>                 a single brick:
>>>>>>>>
>>>>>>>>                 NODE(<guid>)
>>>>>>>>
>>>>>>>>                 AFR/EC:
>>>>>>>>
>>>>>>>>                 AFR[2](NODE(<guid>), NODE(<guid>))
>>>>>>>>                 EC[3,1](NODE(<guid>), NODE(<guid>), NODE(<guid>))
>>>>>>>>
>>>>>>>>                 DHT:
>>>>>>>>
>>>>>>>>                 DHT[2](AFR[2](NODE(<guid>), NODE(<guid>)),
>>>>>>>>             AFR[2](NODE(<guid>),
>>>>>>>>                 NODE(<guid>)))
>>>>>>>>
>>>>>>>>                 This gives a lot of information that can be used to
>>>>>>>> take the
>>>>>>>>                 appropriate decisions.
>>>>>>>>
>>>>>>>>
>>>>>>>>             I guess that is not backward compatible. Shall I CC
>>>>>>>>             gluster-devel and
>>>>>>>>             Kotresh/Aravinda?
>>>>>>>>
>>>>>>>>
>>>>>>>>         Is the change we did backward compatible ? if we only
>>>>>>>> require
>>>>>>>>         the first field to be a GUID to support backward
>>>>>>>> compatibility,
>>>>>>>>         we can use something like this:
>>>>>>>>
>>>>>>>>     No. But the necessary change can be made to Geo-rep code as
>>>>>>>> well if
>>>>>>>>     format is changed, Since all these are built/shipped together.
>>>>>>>>
>>>>>>>>     Geo-rep uses node-id as follows,
>>>>>>>>
>>>>>>>>     list = listxattr(node-uuid)
>>>>>>>>     active_node_uuids = list.split(SPACE)
>>>>>>>>     active_node_flag = True if self.node_id exists in
>>>>>>>> active_node_uuids
>>>>>>>>     else False
>>>>>>>
>>>>>>>
>>>>>>> How was this case solved ?
>>>>>>>
>>>>>>> suppose we have three servers and 2 bricks in each server. A
>>>>>>> replicated volume is created using the following command:
>>>>>>>
>>>>>>> gluster volume create test replica 2 server1:/brick1 server2:/brick1
>>>>>>> server2:/brick2 server3:/brick1 server3:/brick1 server1:/brick2
>>>>>>>
>>>>>>> In this case we have three replica-sets:
>>>>>>>
>>>>>>> * server1:/brick1 server2:/brick1
>>>>>>> * server2:/brick2 server3:/brick1
>>>>>>> * server3:/brick2 server2:/brick2
>>>>>>>
>>>>>>> Old AFR implementation for node-uuid always returned the uuid of the
>>>>>>> node of the first brick, so in this case we will get the uuid of the
>>>>>>> three nodes because all of them are the first brick of a replica-set.
>>>>>>>
>>>>>>> Does this mean that with this configuration all nodes are active ? Is
>>>>>>> this a problem ? Is there any other check to avoid this situation if
>>>>>>> it's not good ?
>>>>>>
>>>>>> Yes all Geo-rep workers will become Active and participate in syncing.
>>>>>> Since changelogs will have the same information in replica bricks this
>>>>>> will lead to duplicate syncing and consuming network bandwidth.
>>>>>>
>>>>>> Node-uuid based Active worker is the default configuration in Geo-rep
>>>>>> till now, Geo-rep also has Meta Volume based syncronization for Active
>>>>>> worker using lock files.(Can be opted using Geo-rep configuration,
>>>>>> with this config node-uuid will not be used)
>>>>>>
>>>>>> Kotresh proposed a solution to configure which worker to become
>>>>>> Active. This will give more control to Admin to choose Active workers,
>>>>>> This will become default configuration from 3.12
>>>>>> https://github.com/gluster/glusterfs/issues/244
>>>>>>
>>>>>> --
>>>>>> Aravinda
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Xavi
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>         Bricks:
>>>>>>>>
>>>>>>>>         <guid>
>>>>>>>>
>>>>>>>>         AFR/EC:
>>>>>>>>         <guid>(<guid>, <guid>)
>>>>>>>>
>>>>>>>>         DHT:
>>>>>>>>         <guid>(<guid>(<guid>, ...), <guid>(<guid>, ...))
>>>>>>>>
>>>>>>>>         In this case, AFR and EC would return the same <guid> they
>>>>>>>>         returned before the patch, but between '(' and ')' they put
>>>>>>>> the
>>>>>>>>         full list of guid's of all nodes. The first <guid> can be
>>>>>>>> used
>>>>>>>>         by geo-replication. The list after the first <guid> can be
>>>>>>>> used
>>>>>>>>         for rebalance.
>>>>>>>>
>>>>>>>>         Not sure if there's any user of node-uuid above DHT.
>>>>>>>>
>>>>>>>>         Xavi
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>                 Xavi
>>>>>>>>
>>>>>>>>
>>>>>>>>                     On Tue, Jun 20, 2017 at 12:46 PM, Xavier
>>>>>>>> Hernandez
>>>>>>>>                     <xhernandez at datalab.es
>>>>>>>>             <mailto:xhernandez at datalab.es>
>>>>>>>> <mailto:xhernandez at datalab.es
>>>>>>>>             <mailto:xhernandez at datalab.es>>
>>>>>>>>                     <mailto:xhernandez at datalab.es
>>>>>>>>             <mailto:xhernandez at datalab.es>
>>>>>>>> <mailto:xhernandez at datalab.es
>>>>>>>>             <mailto:xhernandez at datalab.es>>>>
>>>>>>>>                     wrote:
>>>>>>>>
>>>>>>>>                         Hi Pranith,
>>>>>>>>
>>>>>>>>                         On 20/06/17 07:53, Pranith Kumar Karampuri
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>                             hi Xavi,
>>>>>>>>                                    We all made the mistake of not
>>>>>>>>             sending about changing
>>>>>>>>                             behavior of
>>>>>>>>                             node-uuid xattr so that rebalance can
>>>>>>>> use
>>>>>>>>             multiple nodes
>>>>>>>>                     for doing
>>>>>>>>                             rebalance. Because of this on geo-rep
>>>>>>>> all
>>>>>>>>             the workers
>>>>>>>>                     are becoming
>>>>>>>>                             active instead of one per EC/AFR
>>>>>>>> subvolume.
>>>>>>>>             So we are
>>>>>>>>                             frantically trying
>>>>>>>>                             to restore the functionality of
>>>>>>>> node-uuid
>>>>>>>>             and introduce
>>>>>>>>                     a new
>>>>>>>>                             xattr for
>>>>>>>>                             the new behavior. Sunil will be sending
>>>>>>>> out
>>>>>>>>             a patch for
>>>>>>>>                     this.
>>>>>>>>
>>>>>>>>
>>>>>>>>                         Wouldn't it be better to change geo-rep
>>>>>>>> behavior
>>>>>>>>             to use the
>>>>>>>>                     new data
>>>>>>>>                         ? I think it's better as it's now, since it
>>>>>>>>             gives more
>>>>>>>>                     information
>>>>>>>>                         to upper layers so that they can take more
>>>>>>>>             accurate decisions.
>>>>>>>>
>>>>>>>>                         Xavi
>>>>>>>>
>>>>>>>>
>>>>>>>>                             --
>>>>>>>>                             Pranith
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>                     --
>>>>>>>>                     Pranith
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>             --
>>>>>>>>             Pranith
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Pranith
>>>>>>>
>>>>>>>
>>>
>>>
>>> --
>>> Pranith
>>>
>>>
>
>
> --
> Pranith
>
>
>
>

-- 
Pranith
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-devel/attachments/20170621/de8492e2/attachment-0001.html>