[Gluster-devel] geo-rep regression because of node-uuid change

Mon Jul 3 06:33:48 UTC 2017

Xavi,
      Now that the change has been reverted, we can resume this discussion
and decide on the exact format that considers, tier, dht, afr, ec. People
working geo-rep/dht/afr/ec had an internal discussion and we all agreed
that this proposal would be a good way forward. I think once we agree on
the format and decide on the initial encoding/decoding functions of the
xattr and this change is merged, we can send patches on afr/ec/dht and
geo-rep to take it to closure.

Could you propose the new format you have in mind that considers all of the
xlators?

On Wed, Jun 21, 2017 at 2:08 PM, Karthik Subrahmanya <ksubrahm at redhat.com>
wrote:

>
>
> On Wed, Jun 21, 2017 at 1:56 PM, Xavier Hernandez <xhernandez at datalab.es>
> wrote:
>
>> That's ok. I'm currently unable to write a patch for this on ec.
>
> Sunil is working on this patch.
>
> ~Karthik
>
>> If no one can do it, I can try to do it in 6 - 7 hours...
>>
>> Xavi
>>
>>
>> On Wednesday, June 21, 2017 09:48 CEST, Pranith Kumar Karampuri <
>> pkarampu at redhat.com> wrote:
>>
>>
>>
>>
>> On Wed, Jun 21, 2017 at 1:00 PM, Xavier Hernandez <xhernandez at datalab.es>
>> wrote:
>>>
>>> I'm ok with reverting node-uuid content to the previous format and
>>> create a new xattr for the new format. Currently, only rebalance will use
>>> it.
>>>
>>> Only thing to consider is what can happen if we have a half upgraded
>>> cluster where some clients have this change and some not. Can rebalance
>>> work in this situation ? if so, could there be any issue ?
>>
>>
>> I think there shouldn't be any problem, because this is in-memory xattr
>> so layers below afr/ec will only see node-uuid xattr.
>> This also gives us a chance to do whatever we want to do in future with
>> this xattr without any problems about backward compatibility.
>>
>> You can check https://review.gluster.org/#/c
>> /17576/3/xlators/cluster/afr/src/afr-inode-read.c at 1507 for how karthik
>> implemented this in AFR (this got merged accidentally yesterday, but looks
>> like this is what we are settling on)
>>
>>
>>>
>>> Xavi
>>>
>>>
>>> On Wednesday, June 21, 2017 06:56 CEST, Pranith Kumar Karampuri <
>>> pkarampu at redhat.com> wrote:
>>>
>>>
>>>
>>>
>>> On Wed, Jun 21, 2017 at 10:07 AM, Nithya Balachandran <
>>> nbalacha at redhat.com> wrote:
>>>>
>>>>
>>>> On 20 June 2017 at 20:38, Aravinda <avishwan at redhat.com> wrote:
>>>>>
>>>>> On 06/20/2017 06:02 PM, Pranith Kumar Karampuri wrote:
>>>>>
>>>>> Xavi, Aravinda and I had a discussion on #gluster-dev and we agreed to
>>>>> go with the format Aravinda suggested for now and in future we wanted some
>>>>> more changes for dht to detect which subvolume went down came back up, at
>>>>> that time we will revisit the solution suggested by Xavi.
>>>>>
>>>>> Susanth is doing the dht changes
>>>>> Aravinda is doing geo-rep changes
>>>>>
>>>>> Done. Geo-rep patch sent for review https://review.gluster.org/17582
>>>>>
>>>>>
>>>>
>>>> The proposed changes to the node-uuid behaviour (while good) are going
>>>> to break tiering . Tiering changes will take a little more time to be coded
>>>> and tested.
>>>>
>>>> As this is a regression for 3.11 and a blocker for 3.11.1, I suggest we
>>>> go back to the original node-uuid behaviour for now so as to unblock the
>>>> release and target the proposed changes for the next 3.11 releases.
>>>>
>>>
>>> Let me see if I understand the changes correctly. We are restoring the
>>> behavior of node-uuid xattr and adding a new xattr for parallel rebalance
>>> for both afr and ec, correct? Otherwise that is one more regression. If
>>> yes, we will also wait for Xavi's inputs. Jeff accidentally merged the afr
>>> patch yesterday which does these changes. If everyone is in agreement, we
>>> will leave it as is and add similar changes in ec as well. If we are not in
>>> agreement, then we will let the discussion progress :-)
>>>
>>>
>>>>
>>>>
>>>> Regards,
>>>> Nithya
>>>>
>>>>> --
>>>>> Aravinda
>>>>>
>>>>>
>>>>>
>>>>> Thanks to all of you guys for the discussions!
>>>>>
>>>>> On Tue, Jun 20, 2017 at 5:05 PM, Xavier Hernandez <
>>>>> xhernandez at datalab.es> wrote:
>>>>>>
>>>>>> Hi Aravinda,
>>>>>>
>>>>>> On 20/06/17 12:42, Aravinda wrote:
>>>>>>>
>>>>>>> I think following format can be easily adopted by all components
>>>>>>>
>>>>>>> UUIDs of a subvolume are seperated by space and subvolumes are
>>>>>>> separated
>>>>>>> by comma
>>>>>>>
>>>>>>> For example, node1 and node2 are replica with U1 and U2 UUIDs
>>>>>>> respectively and
>>>>>>> node3 and node4 are replica with U3 and U4 UUIDs respectively
>>>>>>>
>>>>>>> node-uuid can return "U1 U2,U3 U4"
>>>>>>
>>>>>>
>>>>>> While this is ok for current implementation, I think this can be
>>>>>> insufficient if there are more layers of xlators that require to indicate
>>>>>> some sort of grouping. Some representation that can represent hierarchy
>>>>>> would be better. For example: "(U1 U2) (U3 U4)" (we can use spaces or comma
>>>>>> as a separator).
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Geo-rep can split by "," and then split by space and take first UUID
>>>>>>> DHT can split the value by space or comma and get unique UUIDs list
>>>>>>
>>>>>>
>>>>>> This doesn't solve the problem I described in the previous email.
>>>>>> Some more logic will need to be added to avoid more than one node from each
>>>>>> replica-set to be active. If we have some explicit hierarchy information in
>>>>>> the node-uuid value, more decisions can be taken.
>>>>>>
>>>>>> An initial proposal I made was this:
>>>>>>
>>>>>> DHT[2](AFR[2,0](NODE(U1), NODE(U2)), AFR[2,0](NODE(U1), NODE(U2)))
>>>>>>
>>>>>> This is harder to parse, but gives a lot of information: DHT with 2
>>>>>> subvolumes, each subvolume is an AFR with replica 2 and no arbiters. It's
>>>>>> also easily extensible with any new xlator that changes the layout.
>>>>>>
>>>>>> However maybe this is not the moment to do this, and probably we
>>>>>> could implement this in a new xattr with a better name.
>>>>>>
>>>>>> Xavi
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Another question is about the behavior when a node is down, existing
>>>>>>> node-uuid xattr will not return that UUID if a node is down. What is
>>>>>>> the
>>>>>>> behavior with the proposed xattr?
>>>>>>>
>>>>>>> Let me know your thoughts.
>>>>>>>
>>>>>>> regards
>>>>>>> Aravinda VK
>>>>>>>
>>>>>>> On 06/20/2017 03:06 PM, Aravinda wrote:
>>>>>>>>
>>>>>>>> Hi Xavi,
>>>>>>>>
>>>>>>>> On 06/20/2017 02:51 PM, Xavier Hernandez wrote:
>>>>>>>>>
>>>>>>>>> Hi Aravinda,
>>>>>>>>>
>>>>>>>>> On 20/06/17 11:05, Pranith Kumar Karampuri wrote:
>>>>>>>>>>
>>>>>>>>>> Adding more people to get a consensus about this.
>>>>>>>>>>
>>>>>>>>>> On Tue, Jun 20, 2017 at 1:49 PM, Aravinda <avishwan at redhat.com
>>>>>>>>>> <mailto:avishwan at redhat.com>> wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>     regards
>>>>>>>>>>     Aravinda VK
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>     On 06/20/2017 01:26 PM, Xavier Hernandez wrote:
>>>>>>>>>>
>>>>>>>>>>         Hi Pranith,
>>>>>>>>>>
>>>>>>>>>>         adding gluster-devel, Kotresh and Aravinda,
>>>>>>>>>>
>>>>>>>>>>         On 20/06/17 09:45, Pranith Kumar Karampuri wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>             On Tue, Jun 20, 2017 at 1:12 PM, Xavier Hernandez
>>>>>>>>>>             <xhernandez at datalab.es <mailto:xhernandez at datalab.es>
>>>>>>>>>>             <mailto:xhernandez at datalab.es
>>>>>>>>>>             <mailto:xhernandez at datalab.es>>> wrote:
>>>>>>>>>>
>>>>>>>>>>                 On 20/06/17 09:31, Pranith Kumar Karampuri wrote:
>>>>>>>>>>
>>>>>>>>>>                     The way geo-replication works is:
>>>>>>>>>>                     On each machine, it does getxattr of
>>>>>>>>>> node-uuid and
>>>>>>>>>>             check if its
>>>>>>>>>>                     own uuid
>>>>>>>>>>                     is present in the list. If it is present then
>>>>>>>>>> it
>>>>>>>>>>             will consider
>>>>>>>>>>                     it active
>>>>>>>>>>                     otherwise it will be considered passive. With
>>>>>>>>>> this
>>>>>>>>>>             change we are
>>>>>>>>>>                     giving
>>>>>>>>>>                     all uuids instead of first-up subvolume. So
>>>>>>>>>> all
>>>>>>>>>>             machines think
>>>>>>>>>>                     they are
>>>>>>>>>>                     ACTIVE which is bad apparently. So that is the
>>>>>>>>>>             reason. Even I
>>>>>>>>>>                     felt bad
>>>>>>>>>>                     that we are doing this change.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>                 And what about changing the content of node-uuid
>>>>>>>>>> to
>>>>>>>>>>             include some
>>>>>>>>>>                 sort of hierarchy ?
>>>>>>>>>>
>>>>>>>>>>                 for example:
>>>>>>>>>>
>>>>>>>>>>                 a single brick:
>>>>>>>>>>
>>>>>>>>>>                 NODE(<guid>)
>>>>>>>>>>
>>>>>>>>>>                 AFR/EC:
>>>>>>>>>>
>>>>>>>>>>                 AFR[2](NODE(<guid>), NODE(<guid>))
>>>>>>>>>>                 EC[3,1](NODE(<guid>), NODE(<guid>), NODE(<guid>))
>>>>>>>>>>
>>>>>>>>>>                 DHT:
>>>>>>>>>>
>>>>>>>>>>                 DHT[2](AFR[2](NODE(<guid>), NODE(<guid>)),
>>>>>>>>>>             AFR[2](NODE(<guid>),
>>>>>>>>>>                 NODE(<guid>)))
>>>>>>>>>>
>>>>>>>>>>                 This gives a lot of information that can be used
>>>>>>>>>> to
>>>>>>>>>> take the
>>>>>>>>>>                 appropriate decisions.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>             I guess that is not backward compatible. Shall I CC
>>>>>>>>>>             gluster-devel and
>>>>>>>>>>             Kotresh/Aravinda?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>         Is the change we did backward compatible ? if we only
>>>>>>>>>> require
>>>>>>>>>>         the first field to be a GUID to support backward
>>>>>>>>>> compatibility,
>>>>>>>>>>         we can use something like this:
>>>>>>>>>>
>>>>>>>>>>     No. But the necessary change can be made to Geo-rep code as
>>>>>>>>>> well if
>>>>>>>>>>     format is changed, Since all these are built/shipped together.
>>>>>>>>>>
>>>>>>>>>>     Geo-rep uses node-id as follows,
>>>>>>>>>>
>>>>>>>>>>     list = listxattr(node-uuid)
>>>>>>>>>>     active_node_uuids = list.split(SPACE)
>>>>>>>>>>     active_node_flag = True if self.node_id exists in
>>>>>>>>>> active_node_uuids
>>>>>>>>>>     else False
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> How was this case solved ?
>>>>>>>>>
>>>>>>>>> suppose we have three servers and 2 bricks in each server. A
>>>>>>>>> replicated volume is created using the following command:
>>>>>>>>>
>>>>>>>>> gluster volume create test replica 2 server1:/brick1
>>>>>>>>> server2:/brick1
>>>>>>>>> server2:/brick2 server3:/brick1 server3:/brick1 server1:/brick2
>>>>>>>>>
>>>>>>>>> In this case we have three replica-sets:
>>>>>>>>>
>>>>>>>>> * server1:/brick1 server2:/brick1
>>>>>>>>> * server2:/brick2 server3:/brick1
>>>>>>>>> * server3:/brick2 server2:/brick2
>>>>>>>>>
>>>>>>>>> Old AFR implementation for node-uuid always returned the uuid of
>>>>>>>>> the
>>>>>>>>> node of the first brick, so in this case we will get the uuid of
>>>>>>>>> the
>>>>>>>>> three nodes because all of them are the first brick of a
>>>>>>>>> replica-set.
>>>>>>>>>
>>>>>>>>> Does this mean that with this configuration all nodes are active ?
>>>>>>>>> Is
>>>>>>>>> this a problem ? Is there any other check to avoid this situation
>>>>>>>>> if
>>>>>>>>> it's not good ?
>>>>>>>>
>>>>>>>> Yes all Geo-rep workers will become Active and participate in
>>>>>>>> syncing.
>>>>>>>> Since changelogs will have the same information in replica bricks
>>>>>>>> this
>>>>>>>> will lead to duplicate syncing and consuming network bandwidth.
>>>>>>>>
>>>>>>>> Node-uuid based Active worker is the default configuration in
>>>>>>>> Geo-rep
>>>>>>>> till now, Geo-rep also has Meta Volume based syncronization for
>>>>>>>> Active
>>>>>>>> worker using lock files.(Can be opted using Geo-rep configuration,
>>>>>>>> with this config node-uuid will not be used)
>>>>>>>>
>>>>>>>> Kotresh proposed a solution to configure which worker to become
>>>>>>>> Active. This will give more control to Admin to choose Active
>>>>>>>> workers,
>>>>>>>> This will become default configuration from 3.12
>>>>>>>> https://github.com/gluster/glusterfs/issues/244
>>>>>>>>
>>>>>>>> --
>>>>>>>> Aravinda
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Xavi
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>         Bricks:
>>>>>>>>>>
>>>>>>>>>>         <guid>
>>>>>>>>>>
>>>>>>>>>>         AFR/EC:
>>>>>>>>>>         <guid>(<guid>, <guid>)
>>>>>>>>>>
>>>>>>>>>>         DHT:
>>>>>>>>>>         <guid>(<guid>(<guid>, ...), <guid>(<guid>, ...))
>>>>>>>>>>
>>>>>>>>>>         In this case, AFR and EC would return the same <guid> they
>>>>>>>>>>         returned before the patch, but between '(' and ')' they
>>>>>>>>>> put the
>>>>>>>>>>         full list of guid's of all nodes. The first <guid> can be
>>>>>>>>>> used
>>>>>>>>>>         by geo-replication. The list after the first <guid> can
>>>>>>>>>> be used
>>>>>>>>>>         for rebalance.
>>>>>>>>>>
>>>>>>>>>>         Not sure if there's any user of node-uuid above DHT.
>>>>>>>>>>
>>>>>>>>>>         Xavi
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>                 Xavi
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>                     On Tue, Jun 20, 2017 at 12:46 PM, Xavier
>>>>>>>>>> Hernandez
>>>>>>>>>>                     <xhernandez at datalab.es
>>>>>>>>>>             <mailto:xhernandez at datalab.es>
>>>>>>>>>> <mailto:xhernandez at datalab.es
>>>>>>>>>>             <mailto:xhernandez at datalab.es>>
>>>>>>>>>>                     <mailto:xhernandez at datalab.es
>>>>>>>>>>             <mailto:xhernandez at datalab.es>
>>>>>>>>>> <mailto:xhernandez at datalab.es
>>>>>>>>>>             <mailto:xhernandez at datalab.es>>>>
>>>>>>>>>>                     wrote:
>>>>>>>>>>
>>>>>>>>>>                         Hi Pranith,
>>>>>>>>>>
>>>>>>>>>>                         On 20/06/17 07:53, Pranith Kumar Karampuri
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>                             hi Xavi,
>>>>>>>>>>                                    We all made the mistake of not
>>>>>>>>>>             sending about changing
>>>>>>>>>>                             behavior of
>>>>>>>>>>                             node-uuid xattr so that rebalance can
>>>>>>>>>> use
>>>>>>>>>>             multiple nodes
>>>>>>>>>>                     for doing
>>>>>>>>>>                             rebalance. Because of this on geo-rep
>>>>>>>>>> all
>>>>>>>>>>             the workers
>>>>>>>>>>                     are becoming
>>>>>>>>>>                             active instead of one per EC/AFR
>>>>>>>>>> subvolume.
>>>>>>>>>>             So we are
>>>>>>>>>>                             frantically trying
>>>>>>>>>>                             to restore the functionality of
>>>>>>>>>> node-uuid
>>>>>>>>>>             and introduce
>>>>>>>>>>                     a new
>>>>>>>>>>                             xattr for
>>>>>>>>>>                             the new behavior. Sunil will be
>>>>>>>>>> sending out
>>>>>>>>>>             a patch for
>>>>>>>>>>                     this.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>                         Wouldn't it be better to change geo-rep
>>>>>>>>>> behavior
>>>>>>>>>>             to use the
>>>>>>>>>>                     new data
>>>>>>>>>>                         ? I think it's better as it's now, since
>>>>>>>>>> it
>>>>>>>>>>             gives more
>>>>>>>>>>                     information
>>>>>>>>>>                         to upper layers so that they can take more
>>>>>>>>>>             accurate decisions.
>>>>>>>>>>
>>>>>>>>>>                         Xavi
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>                             --
>>>>>>>>>>                             Pranith
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>                     --
>>>>>>>>>>                     Pranith
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>             --
>>>>>>>>>>             Pranith
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Pranith
>>>>>>>>>
>>>>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Pranith
>>>>>
>>>>>
>>>
>>>
>>> --
>>> Pranith
>>>
>>>
>>>
>>>
>>
>>
>>
>> --
>> Pranith
>>
>>
>>
>>
>
>

-- 
Pranith
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-devel/attachments/20170703/bea086fa/attachment-0001.html>