[Gluster-devel] geo-rep regression because of node-uuid change

Fri Jul 7 14:41:56 UTC 2017

On Fri, Jul 7, 2017 at 3:05 PM, Xavier Hernandez <xhernandez at datalab.es>
wrote:

> On 07/07/17 11:25, Pranith Kumar Karampuri wrote:
>
>>
>>
>> On Fri, Jul 7, 2017 at 2:46 PM, Xavier Hernandez <xhernandez at datalab.es
>> <mailto:xhernandez at datalab.es>> wrote:
>>
>>     On 07/07/17 10:12, Pranith Kumar Karampuri wrote:
>>
>>
>>
>>         On Fri, Jul 7, 2017 at 1:13 PM, Xavier Hernandez
>>         <xhernandez at datalab.es <mailto:xhernandez at datalab.es>
>>         <mailto:xhernandez at datalab.es <mailto:xhernandez at datalab.es>>>
>>         wrote:
>>
>>             Hi Pranith,
>>
>>             On 05/07/17 12:28, Pranith Kumar Karampuri wrote:
>>
>>
>>
>>                 On Tue, Jul 4, 2017 at 2:26 PM, Xavier Hernandez
>>                 <xhernandez at datalab.es <mailto:xhernandez at datalab.es>
>>         <mailto:xhernandez at datalab.es <mailto:xhernandez at datalab.es>>
>>                 <mailto:xhernandez at datalab.es
>>         <mailto:xhernandez at datalab.es> <mailto:xhernandez at datalab.es
>>         <mailto:xhernandez at datalab.es>>>>
>>                 wrote:
>>
>>                     Hi Pranith,
>>
>>                     On 03/07/17 08:33, Pranith Kumar Karampuri wrote:
>>
>>                         Xavi,
>>                               Now that the change has been reverted, we
>> can
>>                 resume this
>>                         discussion and decide on the exact format that
>>                 considers, tier, dht,
>>                         afr, ec. People working geo-rep/dht/afr/ec had
>>         an internal
>>                         discussion
>>                         and we all agreed that this proposal would be a
>>         good way
>>                 forward. I
>>                         think once we agree on the format and decide on
>>         the initial
>>                         encoding/decoding functions of the xattr and
>>         this change is
>>                         merged, we
>>                         can send patches on afr/ec/dht and geo-rep to
>>         take it to
>>                 closure.
>>
>>                         Could you propose the new format you have in
>>         mind that
>>                 considers
>>                         all of
>>                         the xlators?
>>
>>
>>                     My idea was to create a new xattr not bound to any
>>         particular
>>                     function but which could give enough information to
>>         be used
>>                 in many
>>                     places.
>>
>>                     Currently we have another attribute called
>>                 glusterfs.pathinfo that
>>                     returns hierarchical information about the location
>> of a
>>                 file. Maybe
>>                     we can extend this to unify all these attributes
>>         into a single
>>                     feature that could be used for multiple purposes.
>>
>>                     Since we have time to discuss it, I would like to
>>         design it with
>>                     more information than we already talked.
>>
>>                     First of all, the amount of information that this
>>         attribute can
>>                     contain is quite big if we expect to have volumes with
>>                 thousands of
>>                     bricks. Even in the most simple case of returning
>>         only an
>>                 UUID, we
>>                     can easily go beyond the limit of 64KB.
>>
>>                     Consider also, for example, what shard should return
>>         when
>>                 pathinfo
>>                     is requested for a file. Probably it should return a
>>         list of
>>                 shards,
>>                     each one with all its associated pathinfo. We are
>>         talking
>>                 about big
>>                     amounts of data here.
>>
>>                     I think this kind of information doesn't fit very
>>         well in an
>>                     extended attribute. Another think to consider is
>>         that most
>>                 probably
>>                     the requester of the data only needs a fragment of
>>         it, so we are
>>                     generating big amounts of data only to be parsed and
>>         reduced
>>                 later,
>>                     dismissing most of it.
>>
>>                     What do you think about using a very special virtual
>>         file to
>>                 manage
>>                     all this information ? it could be easily read using
>>         normal read
>>                     fops, so it could manage big amounts of data easily.
>>         Also,
>>                 accessing
>>                     only to some parts of the file we could go directly
>>         where we
>>                 want,
>>                     avoiding the read of all remaining data.
>>
>>                     A very basic idea could be this:
>>
>>                     Each xlator would have a reserved area of the file.
>>         We can
>>                 reserve
>>                     up to 4GB per xlator (32 bits). The remaining 32
>>         bits of the
>>                 offset
>>                     would indicate the xlator we want to access.
>>
>>                     At offset 0 we have generic information about the
>>         volume.
>>                 One of the
>>                     the things that this information should include is a
>>         basic
>>                 hierarchy
>>                     of the whole volume and the offset for each xlator.
>>
>>                     After reading this, the user will seek to the
>>         desired offset and
>>                     read the information related to the xlator it is
>>         interested in.
>>
>>                     All the information should be stored in a format
>> easily
>>                 extensible
>>                     that will be kept compatible even if new information
>> is
>>                 added in the
>>                     future (for example doing special mappings of the 32
>>         bits
>>                 offsets
>>                     reserved for the xlator).
>>
>>                     For example we can reserve the first megabyte of the
>>         xlator
>>                 area to
>>                     have a mapping of attributes with its respective
>> offset.
>>
>>                     I think that using a binary format would simplify
>>         all this a
>>                 lot.
>>
>>                     Do you think this is a way to explore or should I stop
>>                 wasting time
>>                     here ?
>>
>>
>>                 I think this just became a very big feature :-). Shall
>>         we just
>>                 live with
>>                 it the way it is now?
>>
>>
>>             I supposed it...
>>
>>             Only thing we need to check is if shard needs to handle this
>>         xattr.
>>             If so, what it should return ? only the UUID's corresponding
>>         to the
>>             first shard or the UUID's of all bricks containing at least
>> one
>>             shard ? I guess that the first one is enough, but just to be
>>         sure...
>>
>>             My proposal was to implement a new xattr, for example
>>             glusterfs.layout, that contains enough information to be
>>         usable in
>>             all current use cases.
>>
>>
>>         Actually pathinfo is supposed to give this information and it
>>         already
>>         has the following format: for a 5x2 distributed-replicate volume
>>
>>
>>     Yes, I know. I wanted to unify all information.
>>
>>
>>         root at dhcp35-190 - /mnt/v3
>>         13:38:12 :) ⚡ getfattr -n trusted.glusterfs.pathinfo d
>>         # file: d
>>         trusted.glusterfs.pathinfo="((<DISTRIBUTE:v3-dht>
>>         (<REPLICATE:v3-replicate-0>
>>         <POSIX(/home/gfs/v3_0):dhcp35-190.lab.eng.blr.redhat.com:/ho
>> me/gfs/v3_0/d>
>>         <POSIX(/home/gfs/v3_1):dhcp35-190.lab.eng.blr.redhat.com:/ho
>> me/gfs/v3_1/d>)
>>         (<REPLICATE:v3-replicate-2>
>>         <POSIX(/home/gfs/v3_5):dhcp35-190.lab.eng.blr.redhat.com:/ho
>> me/gfs/v3_5/d>
>>         <POSIX(/home/gfs/v3_4):dhcp35-190.lab.eng.blr.redhat.com:/ho
>> me/gfs/v3_4/d>)
>>         (<REPLICATE:v3-replicate-1>
>>         <POSIX(/home/gfs/v3_3):dhcp35-190.lab.eng.blr.redhat.com:/ho
>> me/gfs/v3_3/d>
>>         <POSIX(/home/gfs/v3_2):dhcp35-190.lab.eng.blr.redhat.com:/ho
>> me/gfs/v3_2/d>)
>>         (<REPLICATE:v3-replicate-4>
>>         <POSIX(/home/gfs/v3_8):dhcp35-190.lab.eng.blr.redhat.com:/ho
>> me/gfs/v3_8/d>
>>         <POSIX(/home/gfs/v3_9):dhcp35-190.lab.eng.blr.redhat.com:/ho
>> me/gfs/v3_9/d>)
>>         (<REPLICATE:v3-replicate-3>
>>         <POSIX(/home/gfs/v3_6):dhcp35-190.lab.eng.blr.redhat.com:/ho
>> me/gfs/v3_6/d>
>>         <POSIX(/home/gfs/v3_7):dhcp35-190.lab.eng.blr.redhat.com:/ho
>> me/gfs/v3_7/d>))
>>         (v3-dht-layout (v3-replicate-0 0 858993458) (v3-replicate-1
>>         858993459
>>         1717986917) (v3-replicate-2 1717986918 2576980376) (v3-replicate-3
>>         2576980377 3435973835) (v3-replicate-4 3435973836 4294967295)))"
>>
>>
>>         root at dhcp35-190 - /mnt/v3
>>         13:38:26 :) ⚡ getfattr -n trusted.glusterfs.pathinfo d/a
>>         # file: d/a
>>         trusted.glusterfs.pathinfo="(<DISTRIBUTE:v3-dht>
>>         (<REPLICATE:v3-replicate-1>
>>         <POSIX(/home/gfs/v3_3):dhcp35-190.lab.eng.blr.redhat.com:/ho
>> me/gfs/v3_3/d/a>
>>         <POSIX(/home/gfs/v3_2):dhcp35-190.lab.eng.blr.redhat.com:/ho
>> me/gfs/v3_2/d/a>))"
>>
>>
>>
>>
>>             The idea would be that each xlator that makes a significant
>>         change
>>             in the way or the place where files are stored, should put
>>             information in this xattr. The information should include:
>>
>>             * Type (basically AFR, EC, DHT, ...)
>>             * Basic configuration (replication and arbiter for AFR, data
>> and
>>             redundancy for EC, # subvolumes for DHT, shard size for
>>         sharding, ...)
>>             * Quorum imposed by the xlator
>>             * UUID data comming from subvolumes (sorted by brick position)
>>             * It should be easily extensible in the future
>>
>>             The last point is very important to avoid the issues we have
>>         seen
>>             now. We must be able to incorporate more information without
>>             breaking backward compatibility. To do so, we can add tags
>>         for each
>>             value.
>>
>>             For example, a distribute 2, replica 2 volume with 1 arbiter
>>         should
>>             be represented by this string:
>>
>>                DHT[dist=2,quorum=1](
>>                   AFR[rep=2,arbiter=1,quorum=2](
>>                      NODE[quorum=2,uuid=<UUID1>](<path1>),
>>                      NODE[quorum=2,uuid=<UUID2>](<path2>),
>>                      NODE[quorum=2,uuid=<UUID3>](<path3>)
>>                   ),
>>                   AFR[rep=2,arbiter=1,quorum=2](
>>                      NODE[quorum=2,uuid=<UUID4>](<path4>),
>>                      NODE[quorum=2,uuid=<UUID5>](<path5>),
>>                      NODE[quorum=2,uuid=<UUID6>](<path6>)
>>                   )
>>                )
>>
>
Yes, this looks simpler for now.

>
>>             Some explanations:
>>
>>             AFAIK DHT doesn't have quorum, so the default is '1'. We may
>>         decide
>>             to omit it when it's '1' for any xlator.
>>
>>             Quorum in AFR represents client-side enforced quorum. Quorum
>>         in NODE
>>             represents the server-side enforced quorum.
>>
>>             The <path> shown in each NODE represents the physical
>>         location of
>>             the file (similar to current glusterfs.pathinfo) because
>>         this xattr
>>             can be retrieved for a particular file using getxattr. This
>>         is nice,
>>             but we can remove it for now if it's difficult to implement.
>>
>>             We can decide to have a verbose string or try to omit some
>>         fields
>>             when not strictly necessary. For example, if there are no
>>         arbiters,
>>             we can omit the 'arbiter' tag instead of writing 'arbiter=0'.
>> We
>>             could also implicitly compute 'dist' and 'rep' from the
>>         number of
>>             elements contained between '()'.
>>
>>             What do you think ?
>>
>>
>>         Quite a few people are already familiar with path-info. So I am
>>         of the
>>         opinion that we give this information for that xattr itself.
>>         This xattr
>>         hasn't changed after quorum/arbiter/shard came in, so may be
>>         they should?
>>
>>
>>     Not sure how easy would it be to change the format of path-info to
>>     incorporate the new information without breaking existing features
>>     or even user scripts based on it. Maybe a new xattr would be easier
>>     to implement and adapt.
>>
>>
>> Probably.
>>
>>
>>
>>     I missed one important thing in the format: an xlator may have
>>     per-subvolume information. This information can be placed just
>>     before each subvolume information:
>>
>>        DHT[dist=2,quorum=1](
>>           [hash-range=0x00000000-0x7fffffff]AFR[...](...),
>>           [hash-range=0x80000000-0xffffffff]AFR[...](...)
>>        )
>>
>>
>> Yes, makes sense.
>>
>> In general I am better at solving problems someone faces, because things
>> will be more concrete. Do you think it is better to wait until the first
>> consumer of this functionality comes along and gives their inputs about
>> what would be nice to have vs must have? At the moment I am not sure how
>> to distinguish what must be there vs what is nice to have :-(.
>>
>
> The good thing is that using this format we can easily start with bare
> minimum information, like this:
>
>    DHT(
>       AFR(
>          NODE[uuid=<UUID1>],
>          NODE[uuid=<UUID2>],
>          NODE[uuid=<UUID3>]
>       ),
>       AFR(
>          NODE[uuid=<UUID1>],
>          NODE[uuid=<UUID2>],
>          NODE[uuid=<UUID3>]
>       )
>    )
>
> And add more information as it is needed, since it won't break backward
> compatibility.
>
> Xavi
>
>
>>
>>     Xavi
>>
>>
>>
>>
>>             Xavi
>>
>>
>>
>>
>>                     Xavi
>>
>>
>>
>>
>>                         On Wed, Jun 21, 2017 at 2:08 PM, Karthik
>> Subrahmanya
>>                         <ksubrahm at redhat.com
>>         <mailto:ksubrahm at redhat.com> <mailto:ksubrahm at redhat.com
>>         <mailto:ksubrahm at redhat.com>>
>>                 <mailto:ksubrahm at redhat.com <mailto:ksubrahm at redhat.com>
>>         <mailto:ksubrahm at redhat.com <mailto:ksubrahm at redhat.com>>>
>>                         <mailto:ksubrahm at redhat.com
>>         <mailto:ksubrahm at redhat.com> <mailto:ksubrahm at redhat.com
>>         <mailto:ksubrahm at redhat.com>>
>>                 <mailto:ksubrahm at redhat.com <mailto:ksubrahm at redhat.com>
>>         <mailto:ksubrahm at redhat.com <mailto:ksubrahm at redhat.com>>>>>
>> wrote:
>>
>>
>>
>>                             On Wed, Jun 21, 2017 at 1:56 PM, Xavier
>>         Hernandez
>>                             <xhernandez at datalab.es
>>         <mailto:xhernandez at datalab.es>
>>                 <mailto:xhernandez at datalab.es
>>         <mailto:xhernandez at datalab.es>> <mailto:xhernandez at datalab.es
>>         <mailto:xhernandez at datalab.es>
>>                 <mailto:xhernandez at datalab.es
>>         <mailto:xhernandez at datalab.es>>>
>>                         <mailto:xhernandez at datalab.es
>>         <mailto:xhernandez at datalab.es>
>>                 <mailto:xhernandez at datalab.es
>>         <mailto:xhernandez at datalab.es>> <mailto:xhernandez at datalab.es
>>         <mailto:xhernandez at datalab.es>
>>                 <mailto:xhernandez at datalab.es
>>         <mailto:xhernandez at datalab.es>>>>>
>>                         wrote:
>>
>>                                 That's ok. I'm currently unable to write
>>         a patch for
>>                         this on ec.
>>
>>                             Sunil is working on this patch.
>>
>>                             ~Karthik
>>
>>                                 If no one can do it, I can try to do it
>>         in 6 - 7
>>                 hours...
>>
>>                                 Xavi
>>
>>
>>                                 On Wednesday, June 21, 2017 09:48 CEST,
>>         Pranith
>>                 Kumar
>>                         Karampuri
>>                                 <pkarampu at redhat.com
>>         <mailto:pkarampu at redhat.com>
>>                 <mailto:pkarampu at redhat.com
>>         <mailto:pkarampu at redhat.com>> <mailto:pkarampu at redhat.com
>>         <mailto:pkarampu at redhat.com>
>>                 <mailto:pkarampu at redhat.com <mailto:pkarampu at redhat.com
>> >>>
>>                         <mailto:pkarampu at redhat.com
>>         <mailto:pkarampu at redhat.com> <mailto:pkarampu at redhat.com
>>         <mailto:pkarampu at redhat.com>>
>>                 <mailto:pkarampu at redhat.com <mailto:pkarampu at redhat.com>
>>         <mailto:pkarampu at redhat.com <mailto:pkarampu at redhat.com>>>>>
>> wrote:
>>
>>
>>
>>                                     On Wed, Jun 21, 2017 at 1:00 PM,
>> Xavier
>>                 Hernandez
>>                                     <xhernandez at datalab.es
>>         <mailto:xhernandez at datalab.es>
>>                 <mailto:xhernandez at datalab.es
>>         <mailto:xhernandez at datalab.es>>
>>                             <mailto:xhernandez at datalab.es
>>         <mailto:xhernandez at datalab.es>
>>                 <mailto:xhernandez at datalab.es
>>         <mailto:xhernandez at datalab.es>>> <mailto:xhernandez at datalab.es
>>         <mailto:xhernandez at datalab.es>
>>                 <mailto:xhernandez at datalab.es
>>         <mailto:xhernandez at datalab.es>>
>>                             <mailto:xhernandez at datalab.es
>>         <mailto:xhernandez at datalab.es>
>>                 <mailto:xhernandez at datalab.es
>>         <mailto:xhernandez at datalab.es>>>>> wrote:
>>
>>                                         I'm ok with reverting node-uuid
>>         content
>>                 to the
>>                             previous
>>                                         format and create a new xattr
>>         for the
>>                 new format.
>>                                         Currently, only rebalance will
>>         use it.
>>
>>                                         Only thing to consider is what can
>>                 happen if we
>>                             have a
>>                                         half upgraded cluster where some
>>         clients
>>                 have
>>                             this change
>>                                         and some not. Can rebalance work
>>         in this
>>                             situation ? if
>>                                         so, could there be any issue ?
>>
>>
>>                                     I think there shouldn't be any
>> problem,
>>                 because this is
>>                                     in-memory xattr so layers below
>>         afr/ec will
>>                 only see
>>                             node-uuid
>>                                     xattr.
>>                                     This also gives us a chance to do
>>         whatever
>>                 we want
>>                             to do in
>>                                     future with this xattr without any
>>         problems
>>                 about
>>                             backward
>>                                     compatibility.
>>
>>                                     You can check
>>
>>
>>
>>         https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/s
>> rc/afr-inode-read.c at 1507
>>         <https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/
>> src/afr-inode-read.c at 1507>
>>
>>         <https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/
>> src/afr-inode-read.c at 1507
>>         <https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/
>> src/afr-inode-read.c at 1507>>
>>
>>
>>         <https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/
>> src/afr-inode-read.c at 1507
>>         <https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/
>> src/afr-inode-read.c at 1507>
>>
>>         <https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/
>> src/afr-inode-read.c at 1507
>>         <https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/
>> src/afr-inode-read.c at 1507>>>
>>
>>
>>
>>         <https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/
>> src/afr-inode-read.c at 1507
>>         <https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/
>> src/afr-inode-read.c at 1507>
>>
>>         <https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/
>> src/afr-inode-read.c at 1507
>>         <https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/
>> src/afr-inode-read.c at 1507>>
>>
>>
>>         <https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/
>> src/afr-inode-read.c at 1507
>>         <https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/
>> src/afr-inode-read.c at 1507>
>>
>>         <https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/
>> src/afr-inode-read.c at 1507
>>         <https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/
>> src/afr-inode-read.c at 1507>>>>
>>                                     for how karthik implemented this in
>> AFR
>>                 (this got merged
>>                                     accidentally yesterday, but looks
>>         like this
>>                 is what
>>                             we are
>>                                     settling on)
>>
>>
>>
>>                                         Xavi
>>
>>
>>                                         On Wednesday, June 21, 2017
>>         06:56 CEST,
>>                 Pranith
>>                             Kumar
>>                                         Karampuri <pkarampu at redhat.com
>>         <mailto:pkarampu at redhat.com>
>>                 <mailto:pkarampu at redhat.com <mailto:pkarampu at redhat.com>>
>>                             <mailto:pkarampu at redhat.com
>>         <mailto:pkarampu at redhat.com>
>>                 <mailto:pkarampu at redhat.com <mailto:pkarampu at redhat.com
>> >>>
>>                                         <mailto:pkarampu at redhat.com
>>         <mailto:pkarampu at redhat.com>
>>                 <mailto:pkarampu at redhat.com <mailto:pkarampu at redhat.com>>
>>                             <mailto:pkarampu at redhat.com
>>         <mailto:pkarampu at redhat.com>
>>                 <mailto:pkarampu at redhat.com
>>         <mailto:pkarampu at redhat.com>>>>> wrote:
>>
>>
>>
>>                                             On Wed, Jun 21, 2017 at
>>         10:07 AM, Nithya
>>                                 Balachandran
>>                                             <nbalacha at redhat.com
>>         <mailto:nbalacha at redhat.com>
>>                 <mailto:nbalacha at redhat.com <mailto:nbalacha at redhat.com>>
>>                                 <mailto:nbalacha at redhat.com
>>         <mailto:nbalacha at redhat.com>
>>                 <mailto:nbalacha at redhat.com
>>         <mailto:nbalacha at redhat.com>>> <mailto:nbalacha at redhat.com
>>         <mailto:nbalacha at redhat.com>
>>                 <mailto:nbalacha at redhat.com <mailto:nbalacha at redhat.com>>
>>                                 <mailto:nbalacha at redhat.com
>>         <mailto:nbalacha at redhat.com>
>>                 <mailto:nbalacha at redhat.com
>>         <mailto:nbalacha at redhat.com>>>>> wrote:
>>
>>
>>                                                 On 20 June 2017 at
>>         20:38, Aravinda
>>                                                 <avishwan at redhat.com
>>         <mailto:avishwan at redhat.com>
>>                 <mailto:avishwan at redhat.com <mailto:avishwan at redhat.com>>
>>                                 <mailto:avishwan at redhat.com
>>         <mailto:avishwan at redhat.com>
>>                 <mailto:avishwan at redhat.com
>>         <mailto:avishwan at redhat.com>>> <mailto:avishwan at redhat.com
>>         <mailto:avishwan at redhat.com>
>>                 <mailto:avishwan at redhat.com <mailto:avishwan at redhat.com>>
>>                                 <mailto:avishwan at redhat.com
>>         <mailto:avishwan at redhat.com>
>>                 <mailto:avishwan at redhat.com
>>         <mailto:avishwan at redhat.com>>>>> wrote:
>>
>>                                                     On 06/20/2017 06:02
>>         PM, Pranith
>>                                 Kumar Karampuri
>>                                                     wrote:
>>
>>                                                         Xavi, Aravinda
>>         and I had a
>>                                     discussion on
>>                                                         #gluster-dev and
>> we
>>                 agreed to go
>>                                     with the format
>>                                                         Aravinda
>>         suggested for
>>                 now and
>>                                     in future we
>>                                                         wanted some more
>>         changes
>>                 for dht
>>                                     to detect which
>>                                                         subvolume went
>>         down came
>>                 back
>>                                     up, at that time
>>                                                         we will revisit
>>         the solution
>>                                     suggested by Xavi.
>>
>>                                                         Susanth is doing
>>         the dht
>>                 changes
>>                                                         Aravinda is doing
>>                 geo-rep changes
>>
>>                                                     Done. Geo-rep patch
>>         sent for
>>                 review
>>
>>                 https://review.gluster.org/17582
>>         <https://review.gluster.org/17582>
>>         <https://review.gluster.org/17582
>>         <https://review.gluster.org/17582>>
>>                                 <https://review.gluster.org/17582
>>         <https://review.gluster.org/17582>
>>                 <https://review.gluster.org/17582
>>         <https://review.gluster.org/17582>>>
>>
>>                 <https://review.gluster.org/17582
>>         <https://review.gluster.org/17582>
>>         <https://review.gluster.org/17582
>>         <https://review.gluster.org/17582>>
>>                                 <https://review.gluster.org/17582
>>         <https://review.gluster.org/17582>
>>                 <https://review.gluster.org/17582
>>         <https://review.gluster.org/17582>>>>
>>
>>
>>
>>                                                 The proposed changes to
>> the
>>                 node-uuid
>>                                 behaviour
>>                                                 (while good) are going
>>         to break
>>                 tiering
>>                                 . Tiering
>>                                                 changes will take a
>>         little more
>>                 time to
>>                                 be coded and
>>                                                 tested.
>>
>>                                                 As this is a regression
>>         for 3.11
>>                 and a
>>                                 blocker for
>>                                                 3.11.1, I suggest we go
>>         back to
>>                 the original
>>                                                 node-uuid behaviour for
>>         now so as to
>>                                 unblock the
>>                                                 release and target the
>>         proposed
>>                 changes
>>                                 for the next
>>                                                 3.11 releases.
>>
>>
>>                                             Let me see if I understand
>>         the changes
>>                                 correctly. We are
>>                                             restoring the behavior of
>>         node-uuid
>>                 xattr
>>                                 and adding a
>>                                             new xattr for parallel
>>         rebalance for
>>                 both
>>                                 afr and ec,
>>                                             correct? Otherwise that is
>>         one more
>>                                 regression. If yes,
>>                                             we will also wait for Xavi's
>>         inputs.
>>                 Jeff
>>                                 accidentally
>>                                             merged the afr patch
>>         yesterday which
>>                 does
>>                                 these changes.
>>                                             If everyone is in agreement,
>>         we will
>>                 leave
>>                                 it as is and
>>                                             add similar changes in ec as
>>         well.
>>                 If we are
>>                                 not in
>>                                             agreement, then we will let
>> the
>>                 discussion
>>                                 progress :-)
>>
>>
>>
>>
>>                                                 Regards,
>>                                                 Nithya
>>
>>                                                     --
>>                                                     Aravinda
>>
>>
>>                                                         Thanks to all of
>> you
>>                 guys for
>>                                     the discussions!
>>
>>                                                         On Tue, Jun 20,
>>         2017 at
>>                 5:05 PM,
>>                                     Xavier
>>                                                         Hernandez
>>                 <xhernandez at datalab.es <mailto:xhernandez at datalab.es>
>>         <mailto:xhernandez at datalab.es <mailto:xhernandez at datalab.es>>
>>                                     <mailto:xhernandez at datalab.es
>>         <mailto:xhernandez at datalab.es>
>>                 <mailto:xhernandez at datalab.es
>>         <mailto:xhernandez at datalab.es>>>
>>
>>                 <mailto:xhernandez at datalab.es
>>         <mailto:xhernandez at datalab.es> <mailto:xhernandez at datalab.es
>>         <mailto:xhernandez at datalab.es>>
>>                                     <mailto:xhernandez at datalab.es
>>         <mailto:xhernandez at datalab.es>
>>                 <mailto:xhernandez at datalab.es
>>         <mailto:xhernandez at datalab.es>>>>> wrote:
>>
>>                                                             Hi Aravinda,
>>
>>                                                             On 20/06/17
>>         12:42,
>>                 Aravinda
>>                                     wrote:
>>
>>                                                                 I think
>>                 following format
>>                                     can be easily
>>                                                                 adopted
>>         by all
>>                 components
>>
>>                                                                 UUIDs of a
>>                 subvolume are
>>                                     seperated by
>>                                                                 space and
>>                 subvolumes are
>>                                     separated
>>                                                                 by comma
>>
>>                                                                 For
>> example,
>>                 node1 and
>>                                     node2 are replica
>>                                                                 with U1
>>         and U2 UUIDs
>>
>>         respectively and
>>                                                                 node3
>>         and node4 are
>>                                     replica with U3 and
>>                                                                 U4 UUIDs
>>                 respectively
>>
>>
>>         node-uuid can
>>                 return "U1
>>                                     U2,U3 U4"
>>
>>
>>                                                             While this
>>         is ok for
>>                 current
>>                                     implementation,
>>                                                             I think this
>>         can be
>>                                     insufficient if there
>>                                                             are more
>>         layers of
>>                 xlators
>>                                     that require to
>>                                                             indicate
>>         some sort of
>>                                     grouping. Some
>>
>>         representation that can
>>                                     represent hierarchy
>>                                                             would be
>>         better. For
>>                                     example: "(U1 U2) (U3
>>                                                             U4)" (we can
>> use
>>                 spaces or
>>                                     comma as a
>>                                                             separator).
>>
>>
>>
>>                                                                 Geo-rep
>> can
>>                 split by ","
>>                                     and then split
>>                                                                 by space
>> and
>>                 take first UUID
>>                                                                 DHT can
>>         split
>>                 the value
>>                                     by space or
>>                                                                 comma
>>         and get unique
>>                                     UUIDs list
>>
>>
>>                                                             This doesn't
>>         solve the
>>                                     problem I described
>>                                                             in the
>> previous
>>                 email. Some
>>                                     more logic will
>>                                                             need to be
>>         added to
>>                 avoid
>>                                     more than one node
>>                                                             from each
>>                 replica-set to be
>>                                     active. If we
>>                                                             have some
>>         explicit
>>                 hierarchy
>>                                     information in
>>                                                             the
>>         node-uuid value,
>>                 more
>>                                     decisions can be
>>                                                             taken.
>>
>>                                                             An initial
>>         proposal
>>                 I made
>>                                     was this:
>>
>>
>>                 DHT[2](AFR[2,0](NODE(U1),
>>                                     NODE(U2)),
>>
>>         AFR[2,0](NODE(U1),
>>                 NODE(U2)))
>>
>>                                                             This is
>>         harder to
>>                 parse, but
>>                                     gives a lot of
>>                                                             information:
>>         DHT with 2
>>                                     subvolumes, each
>>                                                             subvolume is
>>         an AFR with
>>                                     replica 2 and no
>>                                                             arbiters.
>>         It's also
>>                 easily
>>                                     extensible with
>>                                                             any new
>>         xlator that
>>                 changes
>>                                     the layout.
>>
>>                                                             However
>>         maybe this
>>                 is not
>>                                     the moment to do
>>                                                             this, and
>>         probably
>>                 we could
>>                                     implement this
>>                                                             in a new
>>         xattr with
>>                 a better
>>                                     name.
>>
>>                                                             Xavi
>>
>>
>>
>>                                                                 Another
>>         question is
>>                                     about the behavior
>>                                                                 when a
>>         node is down,
>>                                     existing
>>
>>         node-uuid xattr
>>                 will not
>>                                     return that
>>                                                                 UUID if
>>         a node
>>                 is down.
>>                                     What is the
>>                                                                 behavior
>>         with the
>>                                     proposed xattr?
>>
>>                                                                 Let me
>>         know your
>>                 thoughts.
>>
>>                                                                 regards
>>                                                                 Aravinda
>> VK
>>
>>                                                                 On
>>         06/20/2017
>>                 03:06 PM,
>>                                     Aravinda wrote:
>>
>>                                                                     Hi
>> Xavi,
>>
>>                                                                     On
>>                 06/20/2017 02:51
>>                                     PM, Xavier
>>
>>         Hernandez wrote:
>>
>>
>>         Hi Aravinda,
>>
>>
>>         On 20/06/17
>>                                     11:05, Pranith Kumar
>>
>>                 Karampuri wrote:
>>
>>
>>                 Adding more
>>                                     people to get a
>>
>>                 consensus
>>                                     about this.
>>
>>
>>             On
>>                 Tue, Jun
>>                                     20, 2017 at 1:49
>>
>>             PM,
>>                 Aravinda
>>
>>                                     <avishwan at redhat.com
>>         <mailto:avishwan at redhat.com>
>>                 <mailto:avishwan at redhat.com
>>         <mailto:avishwan at redhat.com>> <mailto:avishwan at redhat.com
>>         <mailto:avishwan at redhat.com>
>>                 <mailto:avishwan at redhat.com <mailto:avishwan at redhat.com
>> >>>
>>
>>                                     <mailto:avishwan at redhat.com
>>         <mailto:avishwan at redhat.com>
>>                 <mailto:avishwan at redhat.com <mailto:avishwan at redhat.com>>
>>                                     <mailto:avishwan at redhat.com
>>         <mailto:avishwan at redhat.com>
>>                 <mailto:avishwan at redhat.com <mailto:avishwan at redhat.com
>> >>>>
>>
>>                                     <mailto:avishwan at redhat.com
>>         <mailto:avishwan at redhat.com>
>>                 <mailto:avishwan at redhat.com
>>         <mailto:avishwan at redhat.com>> <mailto:avishwan at redhat.com
>>         <mailto:avishwan at redhat.com>
>>                 <mailto:avishwan at redhat.com <mailto:avishwan at redhat.com
>> >>>
>>
>>                                     <mailto:avishwan at redhat.com
>>         <mailto:avishwan at redhat.com>
>>                 <mailto:avishwan at redhat.com <mailto:avishwan at redhat.com>>
>>                                     <mailto:avishwan at redhat.com
>>         <mailto:avishwan at redhat.com>
>>                 <mailto:avishwan at redhat.com
>>         <mailto:avishwan at redhat.com>>>>>>
>>
>>             wrote:
>>
>>
>>
>>                 regards
>>
>>                 Aravinda VK
>>
>>
>>
>>                 On
>>                                     06/20/2017 01:26 PM,
>
>

-- 
Pranith
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-devel/attachments/20170707/ac752476/attachment-0001.html>