[Gluster-devel] geo-rep regression because of node-uuid change

Karthik Subrahmanya ksubrahm at redhat.com
Wed Jun 21 08:38:06 UTC 2017


On Wed, Jun 21, 2017 at 1:56 PM, Xavier Hernandez <xhernandez at datalab.es>
wrote:

> That's ok. I'm currently unable to write a patch for this on ec.

Sunil is working on this patch.

~Karthik

> If no one can do it, I can try to do it in 6 - 7 hours...
>
> Xavi
>
>
> On Wednesday, June 21, 2017 09:48 CEST, Pranith Kumar Karampuri <
> pkarampu at redhat.com> wrote:
>
>
>
>
> On Wed, Jun 21, 2017 at 1:00 PM, Xavier Hernandez <xhernandez at datalab.es>
> wrote:
>>
>> I'm ok with reverting node-uuid content to the previous format and create
>> a new xattr for the new format. Currently, only rebalance will use it.
>>
>> Only thing to consider is what can happen if we have a half upgraded
>> cluster where some clients have this change and some not. Can rebalance
>> work in this situation ? if so, could there be any issue ?
>
>
> I think there shouldn't be any problem, because this is in-memory xattr so
> layers below afr/ec will only see node-uuid xattr.
> This also gives us a chance to do whatever we want to do in future with
> this xattr without any problems about backward compatibility.
>
> You can check https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/
> src/afr-inode-read.c at 1507 for how karthik implemented this in AFR (this
> got merged accidentally yesterday, but looks like this is what we are
> settling on)
>
>
>>
>> Xavi
>>
>>
>> On Wednesday, June 21, 2017 06:56 CEST, Pranith Kumar Karampuri <
>> pkarampu at redhat.com> wrote:
>>
>>
>>
>>
>> On Wed, Jun 21, 2017 at 10:07 AM, Nithya Balachandran <
>> nbalacha at redhat.com> wrote:
>>>
>>>
>>> On 20 June 2017 at 20:38, Aravinda <avishwan at redhat.com> wrote:
>>>>
>>>> On 06/20/2017 06:02 PM, Pranith Kumar Karampuri wrote:
>>>>
>>>> Xavi, Aravinda and I had a discussion on #gluster-dev and we agreed to
>>>> go with the format Aravinda suggested for now and in future we wanted some
>>>> more changes for dht to detect which subvolume went down came back up, at
>>>> that time we will revisit the solution suggested by Xavi.
>>>>
>>>> Susanth is doing the dht changes
>>>> Aravinda is doing geo-rep changes
>>>>
>>>> Done. Geo-rep patch sent for review https://review.gluster.org/17582
>>>>
>>>>
>>>
>>> The proposed changes to the node-uuid behaviour (while good) are going
>>> to break tiering . Tiering changes will take a little more time to be coded
>>> and tested.
>>>
>>> As this is a regression for 3.11 and a blocker for 3.11.1, I suggest we
>>> go back to the original node-uuid behaviour for now so as to unblock the
>>> release and target the proposed changes for the next 3.11 releases.
>>>
>>
>> Let me see if I understand the changes correctly. We are restoring the
>> behavior of node-uuid xattr and adding a new xattr for parallel rebalance
>> for both afr and ec, correct? Otherwise that is one more regression. If
>> yes, we will also wait for Xavi's inputs. Jeff accidentally merged the afr
>> patch yesterday which does these changes. If everyone is in agreement, we
>> will leave it as is and add similar changes in ec as well. If we are not in
>> agreement, then we will let the discussion progress :-)
>>
>>
>>>
>>>
>>> Regards,
>>> Nithya
>>>
>>>> --
>>>> Aravinda
>>>>
>>>>
>>>>
>>>> Thanks to all of you guys for the discussions!
>>>>
>>>> On Tue, Jun 20, 2017 at 5:05 PM, Xavier Hernandez <
>>>> xhernandez at datalab.es> wrote:
>>>>>
>>>>> Hi Aravinda,
>>>>>
>>>>> On 20/06/17 12:42, Aravinda wrote:
>>>>>>
>>>>>> I think following format can be easily adopted by all components
>>>>>>
>>>>>> UUIDs of a subvolume are seperated by space and subvolumes are
>>>>>> separated
>>>>>> by comma
>>>>>>
>>>>>> For example, node1 and node2 are replica with U1 and U2 UUIDs
>>>>>> respectively and
>>>>>> node3 and node4 are replica with U3 and U4 UUIDs respectively
>>>>>>
>>>>>> node-uuid can return "U1 U2,U3 U4"
>>>>>
>>>>>
>>>>> While this is ok for current implementation, I think this can be
>>>>> insufficient if there are more layers of xlators that require to indicate
>>>>> some sort of grouping. Some representation that can represent hierarchy
>>>>> would be better. For example: "(U1 U2) (U3 U4)" (we can use spaces or comma
>>>>> as a separator).
>>>>>
>>>>>>
>>>>>>
>>>>>> Geo-rep can split by "," and then split by space and take first UUID
>>>>>> DHT can split the value by space or comma and get unique UUIDs list
>>>>>
>>>>>
>>>>> This doesn't solve the problem I described in the previous email. Some
>>>>> more logic will need to be added to avoid more than one node from each
>>>>> replica-set to be active. If we have some explicit hierarchy information in
>>>>> the node-uuid value, more decisions can be taken.
>>>>>
>>>>> An initial proposal I made was this:
>>>>>
>>>>> DHT[2](AFR[2,0](NODE(U1), NODE(U2)), AFR[2,0](NODE(U1), NODE(U2)))
>>>>>
>>>>> This is harder to parse, but gives a lot of information: DHT with 2
>>>>> subvolumes, each subvolume is an AFR with replica 2 and no arbiters. It's
>>>>> also easily extensible with any new xlator that changes the layout.
>>>>>
>>>>> However maybe this is not the moment to do this, and probably we could
>>>>> implement this in a new xattr with a better name.
>>>>>
>>>>> Xavi
>>>>>
>>>>>>
>>>>>>
>>>>>> Another question is about the behavior when a node is down, existing
>>>>>> node-uuid xattr will not return that UUID if a node is down. What is
>>>>>> the
>>>>>> behavior with the proposed xattr?
>>>>>>
>>>>>> Let me know your thoughts.
>>>>>>
>>>>>> regards
>>>>>> Aravinda VK
>>>>>>
>>>>>> On 06/20/2017 03:06 PM, Aravinda wrote:
>>>>>>>
>>>>>>> Hi Xavi,
>>>>>>>
>>>>>>> On 06/20/2017 02:51 PM, Xavier Hernandez wrote:
>>>>>>>>
>>>>>>>> Hi Aravinda,
>>>>>>>>
>>>>>>>> On 20/06/17 11:05, Pranith Kumar Karampuri wrote:
>>>>>>>>>
>>>>>>>>> Adding more people to get a consensus about this.
>>>>>>>>>
>>>>>>>>> On Tue, Jun 20, 2017 at 1:49 PM, Aravinda <avishwan at redhat.com
>>>>>>>>> <mailto:avishwan at redhat.com>> wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>     regards
>>>>>>>>>     Aravinda VK
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>     On 06/20/2017 01:26 PM, Xavier Hernandez wrote:
>>>>>>>>>
>>>>>>>>>         Hi Pranith,
>>>>>>>>>
>>>>>>>>>         adding gluster-devel, Kotresh and Aravinda,
>>>>>>>>>
>>>>>>>>>         On 20/06/17 09:45, Pranith Kumar Karampuri wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>             On Tue, Jun 20, 2017 at 1:12 PM, Xavier Hernandez
>>>>>>>>>             <xhernandez at datalab.es <mailto:xhernandez at datalab.es>
>>>>>>>>>             <mailto:xhernandez at datalab.es
>>>>>>>>>             <mailto:xhernandez at datalab.es>>> wrote:
>>>>>>>>>
>>>>>>>>>                 On 20/06/17 09:31, Pranith Kumar Karampuri wrote:
>>>>>>>>>
>>>>>>>>>                     The way geo-replication works is:
>>>>>>>>>                     On each machine, it does getxattr of node-uuid
>>>>>>>>> and
>>>>>>>>>             check if its
>>>>>>>>>                     own uuid
>>>>>>>>>                     is present in the list. If it is present then
>>>>>>>>> it
>>>>>>>>>             will consider
>>>>>>>>>                     it active
>>>>>>>>>                     otherwise it will be considered passive. With
>>>>>>>>> this
>>>>>>>>>             change we are
>>>>>>>>>                     giving
>>>>>>>>>                     all uuids instead of first-up subvolume. So all
>>>>>>>>>             machines think
>>>>>>>>>                     they are
>>>>>>>>>                     ACTIVE which is bad apparently. So that is the
>>>>>>>>>             reason. Even I
>>>>>>>>>                     felt bad
>>>>>>>>>                     that we are doing this change.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>                 And what about changing the content of node-uuid to
>>>>>>>>>             include some
>>>>>>>>>                 sort of hierarchy ?
>>>>>>>>>
>>>>>>>>>                 for example:
>>>>>>>>>
>>>>>>>>>                 a single brick:
>>>>>>>>>
>>>>>>>>>                 NODE(<guid>)
>>>>>>>>>
>>>>>>>>>                 AFR/EC:
>>>>>>>>>
>>>>>>>>>                 AFR[2](NODE(<guid>), NODE(<guid>))
>>>>>>>>>                 EC[3,1](NODE(<guid>), NODE(<guid>), NODE(<guid>))
>>>>>>>>>
>>>>>>>>>                 DHT:
>>>>>>>>>
>>>>>>>>>                 DHT[2](AFR[2](NODE(<guid>), NODE(<guid>)),
>>>>>>>>>             AFR[2](NODE(<guid>),
>>>>>>>>>                 NODE(<guid>)))
>>>>>>>>>
>>>>>>>>>                 This gives a lot of information that can be used to
>>>>>>>>> take the
>>>>>>>>>                 appropriate decisions.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>             I guess that is not backward compatible. Shall I CC
>>>>>>>>>             gluster-devel and
>>>>>>>>>             Kotresh/Aravinda?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>         Is the change we did backward compatible ? if we only
>>>>>>>>> require
>>>>>>>>>         the first field to be a GUID to support backward
>>>>>>>>> compatibility,
>>>>>>>>>         we can use something like this:
>>>>>>>>>
>>>>>>>>>     No. But the necessary change can be made to Geo-rep code as
>>>>>>>>> well if
>>>>>>>>>     format is changed, Since all these are built/shipped together.
>>>>>>>>>
>>>>>>>>>     Geo-rep uses node-id as follows,
>>>>>>>>>
>>>>>>>>>     list = listxattr(node-uuid)
>>>>>>>>>     active_node_uuids = list.split(SPACE)
>>>>>>>>>     active_node_flag = True if self.node_id exists in
>>>>>>>>> active_node_uuids
>>>>>>>>>     else False
>>>>>>>>
>>>>>>>>
>>>>>>>> How was this case solved ?
>>>>>>>>
>>>>>>>> suppose we have three servers and 2 bricks in each server. A
>>>>>>>> replicated volume is created using the following command:
>>>>>>>>
>>>>>>>> gluster volume create test replica 2 server1:/brick1 server2:/brick1
>>>>>>>> server2:/brick2 server3:/brick1 server3:/brick1 server1:/brick2
>>>>>>>>
>>>>>>>> In this case we have three replica-sets:
>>>>>>>>
>>>>>>>> * server1:/brick1 server2:/brick1
>>>>>>>> * server2:/brick2 server3:/brick1
>>>>>>>> * server3:/brick2 server2:/brick2
>>>>>>>>
>>>>>>>> Old AFR implementation for node-uuid always returned the uuid of the
>>>>>>>> node of the first brick, so in this case we will get the uuid of the
>>>>>>>> three nodes because all of them are the first brick of a
>>>>>>>> replica-set.
>>>>>>>>
>>>>>>>> Does this mean that with this configuration all nodes are active ?
>>>>>>>> Is
>>>>>>>> this a problem ? Is there any other check to avoid this situation if
>>>>>>>> it's not good ?
>>>>>>>
>>>>>>> Yes all Geo-rep workers will become Active and participate in
>>>>>>> syncing.
>>>>>>> Since changelogs will have the same information in replica bricks
>>>>>>> this
>>>>>>> will lead to duplicate syncing and consuming network bandwidth.
>>>>>>>
>>>>>>> Node-uuid based Active worker is the default configuration in Geo-rep
>>>>>>> till now, Geo-rep also has Meta Volume based syncronization for
>>>>>>> Active
>>>>>>> worker using lock files.(Can be opted using Geo-rep configuration,
>>>>>>> with this config node-uuid will not be used)
>>>>>>>
>>>>>>> Kotresh proposed a solution to configure which worker to become
>>>>>>> Active. This will give more control to Admin to choose Active
>>>>>>> workers,
>>>>>>> This will become default configuration from 3.12
>>>>>>> https://github.com/gluster/glusterfs/issues/244
>>>>>>>
>>>>>>> --
>>>>>>> Aravinda
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Xavi
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>         Bricks:
>>>>>>>>>
>>>>>>>>>         <guid>
>>>>>>>>>
>>>>>>>>>         AFR/EC:
>>>>>>>>>         <guid>(<guid>, <guid>)
>>>>>>>>>
>>>>>>>>>         DHT:
>>>>>>>>>         <guid>(<guid>(<guid>, ...), <guid>(<guid>, ...))
>>>>>>>>>
>>>>>>>>>         In this case, AFR and EC would return the same <guid> they
>>>>>>>>>         returned before the patch, but between '(' and ')' they
>>>>>>>>> put the
>>>>>>>>>         full list of guid's of all nodes. The first <guid> can be
>>>>>>>>> used
>>>>>>>>>         by geo-replication. The list after the first <guid> can be
>>>>>>>>> used
>>>>>>>>>         for rebalance.
>>>>>>>>>
>>>>>>>>>         Not sure if there's any user of node-uuid above DHT.
>>>>>>>>>
>>>>>>>>>         Xavi
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>                 Xavi
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>                     On Tue, Jun 20, 2017 at 12:46 PM, Xavier
>>>>>>>>> Hernandez
>>>>>>>>>                     <xhernandez at datalab.es
>>>>>>>>>             <mailto:xhernandez at datalab.es>
>>>>>>>>> <mailto:xhernandez at datalab.es
>>>>>>>>>             <mailto:xhernandez at datalab.es>>
>>>>>>>>>                     <mailto:xhernandez at datalab.es
>>>>>>>>>             <mailto:xhernandez at datalab.es>
>>>>>>>>> <mailto:xhernandez at datalab.es
>>>>>>>>>             <mailto:xhernandez at datalab.es>>>>
>>>>>>>>>                     wrote:
>>>>>>>>>
>>>>>>>>>                         Hi Pranith,
>>>>>>>>>
>>>>>>>>>                         On 20/06/17 07:53, Pranith Kumar Karampuri
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>                             hi Xavi,
>>>>>>>>>                                    We all made the mistake of not
>>>>>>>>>             sending about changing
>>>>>>>>>                             behavior of
>>>>>>>>>                             node-uuid xattr so that rebalance can
>>>>>>>>> use
>>>>>>>>>             multiple nodes
>>>>>>>>>                     for doing
>>>>>>>>>                             rebalance. Because of this on geo-rep
>>>>>>>>> all
>>>>>>>>>             the workers
>>>>>>>>>                     are becoming
>>>>>>>>>                             active instead of one per EC/AFR
>>>>>>>>> subvolume.
>>>>>>>>>             So we are
>>>>>>>>>                             frantically trying
>>>>>>>>>                             to restore the functionality of
>>>>>>>>> node-uuid
>>>>>>>>>             and introduce
>>>>>>>>>                     a new
>>>>>>>>>                             xattr for
>>>>>>>>>                             the new behavior. Sunil will be
>>>>>>>>> sending out
>>>>>>>>>             a patch for
>>>>>>>>>                     this.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>                         Wouldn't it be better to change geo-rep
>>>>>>>>> behavior
>>>>>>>>>             to use the
>>>>>>>>>                     new data
>>>>>>>>>                         ? I think it's better as it's now, since it
>>>>>>>>>             gives more
>>>>>>>>>                     information
>>>>>>>>>                         to upper layers so that they can take more
>>>>>>>>>             accurate decisions.
>>>>>>>>>
>>>>>>>>>                         Xavi
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>                             --
>>>>>>>>>                             Pranith
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>                     --
>>>>>>>>>                     Pranith
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>             --
>>>>>>>>>             Pranith
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Pranith
>>>>>>>>
>>>>>>>>
>>>>
>>>>
>>>> --
>>>> Pranith
>>>>
>>>>
>>
>>
>> --
>> Pranith
>>
>>
>>
>>
>
>
>
> --
> Pranith
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-devel/attachments/20170621/4783a4e9/attachment-0001.html>


More information about the Gluster-devel mailing list