[Gluster-devel] geo-rep regression because of node-uuid change
Pranith Kumar Karampuri
pkarampu at redhat.com
Wed Jul 5 10:28:30 UTC 2017
On Tue, Jul 4, 2017 at 2:26 PM, Xavier Hernandez <xhernandez at datalab.es>
wrote:
> Hi Pranith,
>
> On 03/07/17 08:33, Pranith Kumar Karampuri wrote:
>
>> Xavi,
>> Now that the change has been reverted, we can resume this
>> discussion and decide on the exact format that considers, tier, dht,
>> afr, ec. People working geo-rep/dht/afr/ec had an internal discussion
>> and we all agreed that this proposal would be a good way forward. I
>> think once we agree on the format and decide on the initial
>> encoding/decoding functions of the xattr and this change is merged, we
>> can send patches on afr/ec/dht and geo-rep to take it to closure.
>>
>> Could you propose the new format you have in mind that considers all of
>> the xlators?
>>
>
> My idea was to create a new xattr not bound to any particular function but
> which could give enough information to be used in many places.
>
> Currently we have another attribute called glusterfs.pathinfo that returns
> hierarchical information about the location of a file. Maybe we can extend
> this to unify all these attributes into a single feature that could be used
> for multiple purposes.
>
> Since we have time to discuss it, I would like to design it with more
> information than we already talked.
>
> First of all, the amount of information that this attribute can contain is
> quite big if we expect to have volumes with thousands of bricks. Even in
> the most simple case of returning only an UUID, we can easily go beyond the
> limit of 64KB.
>
> Consider also, for example, what shard should return when pathinfo is
> requested for a file. Probably it should return a list of shards, each one
> with all its associated pathinfo. We are talking about big amounts of data
> here.
>
> I think this kind of information doesn't fit very well in an extended
> attribute. Another think to consider is that most probably the requester of
> the data only needs a fragment of it, so we are generating big amounts of
> data only to be parsed and reduced later, dismissing most of it.
>
> What do you think about using a very special virtual file to manage all
> this information ? it could be easily read using normal read fops, so it
> could manage big amounts of data easily. Also, accessing only to some parts
> of the file we could go directly where we want, avoiding the read of all
> remaining data.
>
> A very basic idea could be this:
>
> Each xlator would have a reserved area of the file. We can reserve up to
> 4GB per xlator (32 bits). The remaining 32 bits of the offset would
> indicate the xlator we want to access.
>
> At offset 0 we have generic information about the volume. One of the the
> things that this information should include is a basic hierarchy of the
> whole volume and the offset for each xlator.
>
> After reading this, the user will seek to the desired offset and read the
> information related to the xlator it is interested in.
>
> All the information should be stored in a format easily extensible that
> will be kept compatible even if new information is added in the future (for
> example doing special mappings of the 32 bits offsets reserved for the
> xlator).
>
> For example we can reserve the first megabyte of the xlator area to have a
> mapping of attributes with its respective offset.
>
> I think that using a binary format would simplify all this a lot.
>
> Do you think this is a way to explore or should I stop wasting time here ?
>
I think this just became a very big feature :-). Shall we just live with it
the way it is now?
>
> Xavi
>
>
>>
>>
>> On Wed, Jun 21, 2017 at 2:08 PM, Karthik Subrahmanya
>> <ksubrahm at redhat.com <mailto:ksubrahm at redhat.com>> wrote:
>>
>>
>>
>> On Wed, Jun 21, 2017 at 1:56 PM, Xavier Hernandez
>> <xhernandez at datalab.es <mailto:xhernandez at datalab.es>> wrote:
>>
>> That's ok. I'm currently unable to write a patch for this on ec.
>>
>> Sunil is working on this patch.
>>
>> ~Karthik
>>
>> If no one can do it, I can try to do it in 6 - 7 hours...
>>
>> Xavi
>>
>>
>> On Wednesday, June 21, 2017 09:48 CEST, Pranith Kumar Karampuri
>> <pkarampu at redhat.com <mailto:pkarampu at redhat.com>> wrote:
>>
>>
>>>
>>> On Wed, Jun 21, 2017 at 1:00 PM, Xavier Hernandez
>>> <xhernandez at datalab.es <mailto:xhernandez at datalab.es>> wrote:
>>>
>>> I'm ok with reverting node-uuid content to the previous
>>> format and create a new xattr for the new format.
>>> Currently, only rebalance will use it.
>>>
>>> Only thing to consider is what can happen if we have a
>>> half upgraded cluster where some clients have this change
>>> and some not. Can rebalance work in this situation ? if
>>> so, could there be any issue ?
>>>
>>>
>>> I think there shouldn't be any problem, because this is
>>> in-memory xattr so layers below afr/ec will only see node-uuid
>>> xattr.
>>> This also gives us a chance to do whatever we want to do in
>>> future with this xattr without any problems about backward
>>> compatibility.
>>>
>>> You can check
>>> https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/s
>>> rc/afr-inode-read.c at 1507
>>> <https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/
>>> src/afr-inode-read.c at 1507>
>>> for how karthik implemented this in AFR (this got merged
>>> accidentally yesterday, but looks like this is what we are
>>> settling on)
>>>
>>>
>>>
>>> Xavi
>>>
>>>
>>> On Wednesday, June 21, 2017 06:56 CEST, Pranith Kumar
>>> Karampuri <pkarampu at redhat.com
>>> <mailto:pkarampu at redhat.com>> wrote:
>>>
>>>
>>>>
>>>> On Wed, Jun 21, 2017 at 10:07 AM, Nithya Balachandran
>>>> <nbalacha at redhat.com <mailto:nbalacha at redhat.com>> wrote:
>>>>
>>>>
>>>> On 20 June 2017 at 20:38, Aravinda
>>>> <avishwan at redhat.com <mailto:avishwan at redhat.com>>
>>>> wrote:
>>>>
>>>> On 06/20/2017 06:02 PM, Pranith Kumar Karampuri
>>>> wrote:
>>>>
>>>>> Xavi, Aravinda and I had a discussion on
>>>>> #gluster-dev and we agreed to go with the format
>>>>> Aravinda suggested for now and in future we
>>>>> wanted some more changes for dht to detect which
>>>>> subvolume went down came back up, at that time
>>>>> we will revisit the solution suggested by Xavi.
>>>>>
>>>>> Susanth is doing the dht changes
>>>>> Aravinda is doing geo-rep changes
>>>>>
>>>> Done. Geo-rep patch sent for review
>>>> https://review.gluster.org/17582
>>>> <https://review.gluster.org/17582>
>>>>
>>>>
>>>>
>>>> The proposed changes to the node-uuid behaviour
>>>> (while good) are going to break tiering . Tiering
>>>> changes will take a little more time to be coded and
>>>> tested.
>>>>
>>>> As this is a regression for 3.11 and a blocker for
>>>> 3.11.1, I suggest we go back to the original
>>>> node-uuid behaviour for now so as to unblock the
>>>> release and target the proposed changes for the next
>>>> 3.11 releases.
>>>>
>>>>
>>>> Let me see if I understand the changes correctly. We are
>>>> restoring the behavior of node-uuid xattr and adding a
>>>> new xattr for parallel rebalance for both afr and ec,
>>>> correct? Otherwise that is one more regression. If yes,
>>>> we will also wait for Xavi's inputs. Jeff accidentally
>>>> merged the afr patch yesterday which does these changes.
>>>> If everyone is in agreement, we will leave it as is and
>>>> add similar changes in ec as well. If we are not in
>>>> agreement, then we will let the discussion progress :-)
>>>>
>>>>
>>>>
>>>>
>>>> Regards,
>>>> Nithya
>>>>
>>>> --
>>>> Aravinda
>>>>
>>>>
>>>>> Thanks to all of you guys for the discussions!
>>>>>
>>>>> On Tue, Jun 20, 2017 at 5:05 PM, Xavier
>>>>> Hernandez <xhernandez at datalab.es
>>>>> <mailto:xhernandez at datalab.es>> wrote:
>>>>>
>>>>> Hi Aravinda,
>>>>>
>>>>> On 20/06/17 12:42, Aravinda wrote:
>>>>>
>>>>> I think following format can be easily
>>>>> adopted by all components
>>>>>
>>>>> UUIDs of a subvolume are seperated by
>>>>> space and subvolumes are separated
>>>>> by comma
>>>>>
>>>>> For example, node1 and node2 are replica
>>>>> with U1 and U2 UUIDs
>>>>> respectively and
>>>>> node3 and node4 are replica with U3 and
>>>>> U4 UUIDs respectively
>>>>>
>>>>> node-uuid can return "U1 U2,U3 U4"
>>>>>
>>>>>
>>>>> While this is ok for current implementation,
>>>>> I think this can be insufficient if there
>>>>> are more layers of xlators that require to
>>>>> indicate some sort of grouping. Some
>>>>> representation that can represent hierarchy
>>>>> would be better. For example: "(U1 U2) (U3
>>>>> U4)" (we can use spaces or comma as a
>>>>> separator).
>>>>>
>>>>>
>>>>>
>>>>> Geo-rep can split by "," and then split
>>>>> by space and take first UUID
>>>>> DHT can split the value by space or
>>>>> comma and get unique UUIDs list
>>>>>
>>>>>
>>>>> This doesn't solve the problem I described
>>>>> in the previous email. Some more logic will
>>>>> need to be added to avoid more than one node
>>>>> from each replica-set to be active. If we
>>>>> have some explicit hierarchy information in
>>>>> the node-uuid value, more decisions can be
>>>>> taken.
>>>>>
>>>>> An initial proposal I made was this:
>>>>>
>>>>> DHT[2](AFR[2,0](NODE(U1), NODE(U2)),
>>>>> AFR[2,0](NODE(U1), NODE(U2)))
>>>>>
>>>>> This is harder to parse, but gives a lot of
>>>>> information: DHT with 2 subvolumes, each
>>>>> subvolume is an AFR with replica 2 and no
>>>>> arbiters. It's also easily extensible with
>>>>> any new xlator that changes the layout.
>>>>>
>>>>> However maybe this is not the moment to do
>>>>> this, and probably we could implement this
>>>>> in a new xattr with a better name.
>>>>>
>>>>> Xavi
>>>>>
>>>>>
>>>>>
>>>>> Another question is about the behavior
>>>>> when a node is down, existing
>>>>> node-uuid xattr will not return that
>>>>> UUID if a node is down. What is the
>>>>> behavior with the proposed xattr?
>>>>>
>>>>> Let me know your thoughts.
>>>>>
>>>>> regards
>>>>> Aravinda VK
>>>>>
>>>>> On 06/20/2017 03:06 PM, Aravinda wrote:
>>>>>
>>>>> Hi Xavi,
>>>>>
>>>>> On 06/20/2017 02:51 PM, Xavier
>>>>> Hernandez wrote:
>>>>>
>>>>> Hi Aravinda,
>>>>>
>>>>> On 20/06/17 11:05, Pranith Kumar
>>>>> Karampuri wrote:
>>>>>
>>>>> Adding more people to get a
>>>>> consensus about this.
>>>>>
>>>>> On Tue, Jun 20, 2017 at 1:49
>>>>> PM, Aravinda
>>>>> <avishwan at redhat.com
>>>>> <mailto:avishwan at redhat.com>
>>>>> <mailto:avishwan at redhat.com
>>>>> <mailto:avishwan at redhat.com>>>
>>>>> wrote:
>>>>>
>>>>>
>>>>> regards
>>>>> Aravinda VK
>>>>>
>>>>>
>>>>> On 06/20/2017 01:26 PM,
>>>>> Xavier Hernandez wrote:
>>>>>
>>>>> Hi Pranith,
>>>>>
>>>>> adding
>>>>> gluster-devel, Kotresh and
>>>>> Aravinda,
>>>>>
>>>>> On 20/06/17 09:45,
>>>>> Pranith Kumar Karampuri wrote:
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Jun 20,
>>>>> 2017 at 1:12 PM, Xavier
>>>>> Hernandez
>>>>>
>>>>> <xhernandez at datalab.es
>>>>> <mailto:xhernandez at datalab.es>
>>>>> <mailto:xhernandez at datalab.es
>>>>> <mailto:xhernandez at datalab.es>>
>>>>>
>>>>> <mailto:xhernandez at datalab.es
>>>>> <mailto:xhernandez at datalab.es>
>>>>>
>>>>> <mailto:xhernandez at datalab.es
>>>>> <mailto:xhernandez at datalab.es>>>>
>>>>> wrote:
>>>>>
>>>>> On 20/06/17
>>>>> 09:31, Pranith Kumar
>>>>> Karampuri wrote:
>>>>>
>>>>> The way
>>>>> geo-replication works is:
>>>>> On each
>>>>> machine, it does getxattr of
>>>>> node-uuid and
>>>>> check if its
>>>>> own uuid
>>>>> is
>>>>> present in the list. If it
>>>>> is present then it
>>>>> will consider
>>>>> it active
>>>>>
>>>>> otherwise it will be
>>>>> considered passive. With this
>>>>> change we are
>>>>> giving
>>>>> all
>>>>> uuids instead of first-up
>>>>> subvolume. So all
>>>>> machines think
>>>>> they are
>>>>> ACTIVE
>>>>> which is bad apparently. So
>>>>> that is the
>>>>> reason. Even I
>>>>> felt bad
>>>>> that we
>>>>> are doing this change.
>>>>>
>>>>>
>>>>> And what
>>>>> about changing the content
>>>>> of node-uuid to
>>>>> include some
>>>>> sort of
>>>>> hierarchy ?
>>>>>
>>>>> for example:
>>>>>
>>>>> a single brick:
>>>>>
>>>>> NODE(<guid>)
>>>>>
>>>>> AFR/EC:
>>>>>
>>>>>
>>>>> AFR[2](NODE(<guid>),
>>>>> NODE(<guid>))
>>>>>
>>>>> EC[3,1](NODE(<guid>),
>>>>> NODE(<guid>), NODE(<guid>))
>>>>>
>>>>> DHT:
>>>>>
>>>>>
>>>>> DHT[2](AFR[2](NODE(<guid>),
>>>>> NODE(<guid>)),
>>>>>
>>>>> AFR[2](NODE(<guid>),
>>>>> NODE(<guid>)))
>>>>>
>>>>> This gives a
>>>>> lot of information that can
>>>>> be used to
>>>>> take the
>>>>> appropriate
>>>>> decisions.
>>>>>
>>>>>
>>>>> I guess that is
>>>>> not backward compatible.
>>>>> Shall I CC
>>>>> gluster-devel and
>>>>> Kotresh/Aravinda?
>>>>>
>>>>>
>>>>> Is the change we did
>>>>> backward compatible ? if we
>>>>> only require
>>>>> the first field to
>>>>> be a GUID to support
>>>>> backward compatibility,
>>>>> we can use something
>>>>> like this:
>>>>>
>>>>> No. But the necessary
>>>>> change can be made to
>>>>> Geo-rep code as well if
>>>>> format is changed, Since
>>>>> all these are built/shipped
>>>>> together.
>>>>>
>>>>> Geo-rep uses node-id as
>>>>> follows,
>>>>>
>>>>> list = listxattr(node-uuid)
>>>>> active_node_uuids =
>>>>> list.split(SPACE)
>>>>> active_node_flag = True
>>>>> if self.node_id exists in
>>>>> active_node_uuids
>>>>> else False
>>>>>
>>>>>
>>>>> How was this case solved ?
>>>>>
>>>>> suppose we have three servers
>>>>> and 2 bricks in each server. A
>>>>> replicated volume is created
>>>>> using the following command:
>>>>>
>>>>> gluster volume create test
>>>>> replica 2 server1:/brick1
>>>>> server2:/brick1
>>>>> server2:/brick2 server3:/brick1
>>>>> server3:/brick1 server1:/brick2
>>>>>
>>>>> In this case we have three
>>>>> replica-sets:
>>>>>
>>>>> * server1:/brick1 server2:/brick1
>>>>> * server2:/brick2 server3:/brick1
>>>>> * server3:/brick2 server2:/brick2
>>>>>
>>>>> Old AFR implementation for
>>>>> node-uuid always returned the
>>>>> uuid of the
>>>>> node of the first brick, so in
>>>>> this case we will get the uuid
>>>>> of the
>>>>> three nodes because all of them
>>>>> are the first brick of a
>>>>> replica-set.
>>>>>
>>>>> Does this mean that with this
>>>>> configuration all nodes are
>>>>> active ? Is
>>>>> this a problem ? Is there any
>>>>> other check to avoid this
>>>>> situation if
>>>>> it's not good ?
>>>>>
>>>>> Yes all Geo-rep workers will become
>>>>> Active and participate in syncing.
>>>>> Since changelogs will have the same
>>>>> information in replica bricks this
>>>>> will lead to duplicate syncing and
>>>>> consuming network bandwidth.
>>>>>
>>>>> Node-uuid based Active worker is the
>>>>> default configuration in Geo-rep
>>>>> till now, Geo-rep also has Meta
>>>>> Volume based syncronization for Active
>>>>> worker using lock files.(Can be
>>>>> opted using Geo-rep configuration,
>>>>> with this config node-uuid will not
>>>>> be used)
>>>>>
>>>>> Kotresh proposed a solution to
>>>>> configure which worker to become
>>>>> Active. This will give more control
>>>>> to Admin to choose Active workers,
>>>>> This will become default
>>>>> configuration from 3.12
>>>>> https://github.com/gluster/glu
>>>>> sterfs/issues/244
>>>>> <https://github.com/gluster/gl
>>>>> usterfs/issues/244>
>>>>>
>>>>> --
>>>>> Aravinda
>>>>>
>>>>>
>>>>>
>>>>> Xavi
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Bricks:
>>>>>
>>>>> <guid>
>>>>>
>>>>> AFR/EC:
>>>>> <guid>(<guid>, <guid>)
>>>>>
>>>>> DHT:
>>>>>
>>>>> <guid>(<guid>(<guid>, ...),
>>>>> <guid>(<guid>, ...))
>>>>>
>>>>> In this case, AFR
>>>>> and EC would return the same
>>>>> <guid> they
>>>>> returned before the
>>>>> patch, but between '(' and
>>>>> ')' they put the
>>>>> full list of guid's
>>>>> of all nodes. The first
>>>>> <guid> can be used
>>>>> by geo-replication.
>>>>> The list after the first
>>>>> <guid> can be used
>>>>> for rebalance.
>>>>>
>>>>> Not sure if there's
>>>>> any user of node-uuid above
>>>>> DHT.
>>>>>
>>>>> Xavi
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Xavi
>>>>>
>>>>>
>>>>> On Tue,
>>>>> Jun 20, 2017 at 12:46 PM,
>>>>> Xavier Hernandez
>>>>>
>>>>> <xhernandez at datalab.es
>>>>> <mailto:xhernandez at datalab.es>
>>>>>
>>>>> <mailto:xhernandez at datalab.es
>>>>> <mailto:xhernandez at datalab.es>>
>>>>> <mailto:xhernandez at datalab.es
>>>>> <mailto:xhernandez at datalab.es>
>>>>>
>>>>> <mailto:xhernandez at datalab.es
>>>>> <mailto:xhernandez at datalab.es>>>
>>>>>
>>>>> <mailto:xhernandez at datalab.es
>>>>> <mailto:xhernandez at datalab.es>
>>>>>
>>>>> <mailto:xhernandez at datalab.es
>>>>> <mailto:xhernandez at datalab.es>>
>>>>> <mailto:xhernandez at datalab.es
>>>>> <mailto:xhernandez at datalab.es>
>>>>>
>>>>> <mailto:xhernandez at datalab.es
>>>>> <mailto:xhernandez at datalab.es>>>>>
>>>>>
>>>>> wrote:
>>>>>
>>>>> Hi
>>>>> Pranith,
>>>>>
>>>>> On
>>>>> 20/06/17 07:53, Pranith
>>>>> Kumar Karampuri
>>>>> wrote:
>>>>>
>>>>>
>>>>> hi Xavi,
>>>>>
>>>>> We all made the
>>>>> mistake of not
>>>>> sending about
>>>>> changing
>>>>>
>>>>> behavior of
>>>>>
>>>>> node-uuid xattr so that
>>>>> rebalance can use
>>>>> multiple nodes
>>>>> for doing
>>>>>
>>>>> rebalance. Because of this
>>>>> on geo-rep all
>>>>> the workers
>>>>> are
>>>>> becoming
>>>>>
>>>>> active instead of one per
>>>>> EC/AFR subvolume.
>>>>> So we are
>>>>>
>>>>> frantically trying
>>>>>
>>>>> to restore the functionality
>>>>> of node-uuid
>>>>> and introduce
>>>>> a new
>>>>>
>>>>> xattr for
>>>>>
>>>>> the new behavior. Sunil will
>>>>> be sending out
>>>>> a patch for
>>>>> this.
>>>>>
>>>>>
>>>>>
>>>>> Wouldn't it be better to
>>>>> change geo-rep
>>>>> behavior
>>>>> to use the
>>>>> new data
>>>>> ? I
>>>>> think it's better as it's
>>>>> now, since it
>>>>> gives more
>>>>> information
>>>>> to
>>>>> upper layers so that they
>>>>> can take more
>>>>> accurate decisions.
>>>>>
>>>>> Xavi
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> Pranith
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Pranith
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Pranith
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Pranith
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Pranith
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Pranith
>>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> --
>>> Pranith
>>>
>>
>>
>>
>>
>>
>>
>>
>> --
>> Pranith
>>
>
>
--
Pranith
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-devel/attachments/20170705/b263ef5d/attachment-0001.html>
More information about the Gluster-devel
mailing list