[Gluster-devel] geo-rep regression because of node-uuid change

Xavier Hernandez xhernandez at datalab.es
Fri Jul 7 09:35:15 UTC 2017


On 07/07/17 11:25, Pranith Kumar Karampuri wrote:
>
>
> On Fri, Jul 7, 2017 at 2:46 PM, Xavier Hernandez <xhernandez at datalab.es
> <mailto:xhernandez at datalab.es>> wrote:
>
>     On 07/07/17 10:12, Pranith Kumar Karampuri wrote:
>
>
>
>         On Fri, Jul 7, 2017 at 1:13 PM, Xavier Hernandez
>         <xhernandez at datalab.es <mailto:xhernandez at datalab.es>
>         <mailto:xhernandez at datalab.es <mailto:xhernandez at datalab.es>>>
>         wrote:
>
>             Hi Pranith,
>
>             On 05/07/17 12:28, Pranith Kumar Karampuri wrote:
>
>
>
>                 On Tue, Jul 4, 2017 at 2:26 PM, Xavier Hernandez
>                 <xhernandez at datalab.es <mailto:xhernandez at datalab.es>
>         <mailto:xhernandez at datalab.es <mailto:xhernandez at datalab.es>>
>                 <mailto:xhernandez at datalab.es
>         <mailto:xhernandez at datalab.es> <mailto:xhernandez at datalab.es
>         <mailto:xhernandez at datalab.es>>>>
>                 wrote:
>
>                     Hi Pranith,
>
>                     On 03/07/17 08:33, Pranith Kumar Karampuri wrote:
>
>                         Xavi,
>                               Now that the change has been reverted, we can
>                 resume this
>                         discussion and decide on the exact format that
>                 considers, tier, dht,
>                         afr, ec. People working geo-rep/dht/afr/ec had
>         an internal
>                         discussion
>                         and we all agreed that this proposal would be a
>         good way
>                 forward. I
>                         think once we agree on the format and decide on
>         the initial
>                         encoding/decoding functions of the xattr and
>         this change is
>                         merged, we
>                         can send patches on afr/ec/dht and geo-rep to
>         take it to
>                 closure.
>
>                         Could you propose the new format you have in
>         mind that
>                 considers
>                         all of
>                         the xlators?
>
>
>                     My idea was to create a new xattr not bound to any
>         particular
>                     function but which could give enough information to
>         be used
>                 in many
>                     places.
>
>                     Currently we have another attribute called
>                 glusterfs.pathinfo that
>                     returns hierarchical information about the location of a
>                 file. Maybe
>                     we can extend this to unify all these attributes
>         into a single
>                     feature that could be used for multiple purposes.
>
>                     Since we have time to discuss it, I would like to
>         design it with
>                     more information than we already talked.
>
>                     First of all, the amount of information that this
>         attribute can
>                     contain is quite big if we expect to have volumes with
>                 thousands of
>                     bricks. Even in the most simple case of returning
>         only an
>                 UUID, we
>                     can easily go beyond the limit of 64KB.
>
>                     Consider also, for example, what shard should return
>         when
>                 pathinfo
>                     is requested for a file. Probably it should return a
>         list of
>                 shards,
>                     each one with all its associated pathinfo. We are
>         talking
>                 about big
>                     amounts of data here.
>
>                     I think this kind of information doesn't fit very
>         well in an
>                     extended attribute. Another think to consider is
>         that most
>                 probably
>                     the requester of the data only needs a fragment of
>         it, so we are
>                     generating big amounts of data only to be parsed and
>         reduced
>                 later,
>                     dismissing most of it.
>
>                     What do you think about using a very special virtual
>         file to
>                 manage
>                     all this information ? it could be easily read using
>         normal read
>                     fops, so it could manage big amounts of data easily.
>         Also,
>                 accessing
>                     only to some parts of the file we could go directly
>         where we
>                 want,
>                     avoiding the read of all remaining data.
>
>                     A very basic idea could be this:
>
>                     Each xlator would have a reserved area of the file.
>         We can
>                 reserve
>                     up to 4GB per xlator (32 bits). The remaining 32
>         bits of the
>                 offset
>                     would indicate the xlator we want to access.
>
>                     At offset 0 we have generic information about the
>         volume.
>                 One of the
>                     the things that this information should include is a
>         basic
>                 hierarchy
>                     of the whole volume and the offset for each xlator.
>
>                     After reading this, the user will seek to the
>         desired offset and
>                     read the information related to the xlator it is
>         interested in.
>
>                     All the information should be stored in a format easily
>                 extensible
>                     that will be kept compatible even if new information is
>                 added in the
>                     future (for example doing special mappings of the 32
>         bits
>                 offsets
>                     reserved for the xlator).
>
>                     For example we can reserve the first megabyte of the
>         xlator
>                 area to
>                     have a mapping of attributes with its respective offset.
>
>                     I think that using a binary format would simplify
>         all this a
>                 lot.
>
>                     Do you think this is a way to explore or should I stop
>                 wasting time
>                     here ?
>
>
>                 I think this just became a very big feature :-). Shall
>         we just
>                 live with
>                 it the way it is now?
>
>
>             I supposed it...
>
>             Only thing we need to check is if shard needs to handle this
>         xattr.
>             If so, what it should return ? only the UUID's corresponding
>         to the
>             first shard or the UUID's of all bricks containing at least one
>             shard ? I guess that the first one is enough, but just to be
>         sure...
>
>             My proposal was to implement a new xattr, for example
>             glusterfs.layout, that contains enough information to be
>         usable in
>             all current use cases.
>
>
>         Actually pathinfo is supposed to give this information and it
>         already
>         has the following format: for a 5x2 distributed-replicate volume
>
>
>     Yes, I know. I wanted to unify all information.
>
>
>         root at dhcp35-190 - /mnt/v3
>         13:38:12 :) ⚡ getfattr -n trusted.glusterfs.pathinfo d
>         # file: d
>         trusted.glusterfs.pathinfo="((<DISTRIBUTE:v3-dht>
>         (<REPLICATE:v3-replicate-0>
>         <POSIX(/home/gfs/v3_0):dhcp35-190.lab.eng.blr.redhat.com:/home/gfs/v3_0/d>
>         <POSIX(/home/gfs/v3_1):dhcp35-190.lab.eng.blr.redhat.com:/home/gfs/v3_1/d>)
>         (<REPLICATE:v3-replicate-2>
>         <POSIX(/home/gfs/v3_5):dhcp35-190.lab.eng.blr.redhat.com:/home/gfs/v3_5/d>
>         <POSIX(/home/gfs/v3_4):dhcp35-190.lab.eng.blr.redhat.com:/home/gfs/v3_4/d>)
>         (<REPLICATE:v3-replicate-1>
>         <POSIX(/home/gfs/v3_3):dhcp35-190.lab.eng.blr.redhat.com:/home/gfs/v3_3/d>
>         <POSIX(/home/gfs/v3_2):dhcp35-190.lab.eng.blr.redhat.com:/home/gfs/v3_2/d>)
>         (<REPLICATE:v3-replicate-4>
>         <POSIX(/home/gfs/v3_8):dhcp35-190.lab.eng.blr.redhat.com:/home/gfs/v3_8/d>
>         <POSIX(/home/gfs/v3_9):dhcp35-190.lab.eng.blr.redhat.com:/home/gfs/v3_9/d>)
>         (<REPLICATE:v3-replicate-3>
>         <POSIX(/home/gfs/v3_6):dhcp35-190.lab.eng.blr.redhat.com:/home/gfs/v3_6/d>
>         <POSIX(/home/gfs/v3_7):dhcp35-190.lab.eng.blr.redhat.com:/home/gfs/v3_7/d>))
>         (v3-dht-layout (v3-replicate-0 0 858993458) (v3-replicate-1
>         858993459
>         1717986917) (v3-replicate-2 1717986918 2576980376) (v3-replicate-3
>         2576980377 3435973835) (v3-replicate-4 3435973836 4294967295)))"
>
>
>         root at dhcp35-190 - /mnt/v3
>         13:38:26 :) ⚡ getfattr -n trusted.glusterfs.pathinfo d/a
>         # file: d/a
>         trusted.glusterfs.pathinfo="(<DISTRIBUTE:v3-dht>
>         (<REPLICATE:v3-replicate-1>
>         <POSIX(/home/gfs/v3_3):dhcp35-190.lab.eng.blr.redhat.com:/home/gfs/v3_3/d/a>
>         <POSIX(/home/gfs/v3_2):dhcp35-190.lab.eng.blr.redhat.com:/home/gfs/v3_2/d/a>))"
>
>
>
>
>             The idea would be that each xlator that makes a significant
>         change
>             in the way or the place where files are stored, should put
>             information in this xattr. The information should include:
>
>             * Type (basically AFR, EC, DHT, ...)
>             * Basic configuration (replication and arbiter for AFR, data and
>             redundancy for EC, # subvolumes for DHT, shard size for
>         sharding, ...)
>             * Quorum imposed by the xlator
>             * UUID data comming from subvolumes (sorted by brick position)
>             * It should be easily extensible in the future
>
>             The last point is very important to avoid the issues we have
>         seen
>             now. We must be able to incorporate more information without
>             breaking backward compatibility. To do so, we can add tags
>         for each
>             value.
>
>             For example, a distribute 2, replica 2 volume with 1 arbiter
>         should
>             be represented by this string:
>
>                DHT[dist=2,quorum=1](
>                   AFR[rep=2,arbiter=1,quorum=2](
>                      NODE[quorum=2,uuid=<UUID1>](<path1>),
>                      NODE[quorum=2,uuid=<UUID2>](<path2>),
>                      NODE[quorum=2,uuid=<UUID3>](<path3>)
>                   ),
>                   AFR[rep=2,arbiter=1,quorum=2](
>                      NODE[quorum=2,uuid=<UUID4>](<path4>),
>                      NODE[quorum=2,uuid=<UUID5>](<path5>),
>                      NODE[quorum=2,uuid=<UUID6>](<path6>)
>                   )
>                )
>
>             Some explanations:
>
>             AFAIK DHT doesn't have quorum, so the default is '1'. We may
>         decide
>             to omit it when it's '1' for any xlator.
>
>             Quorum in AFR represents client-side enforced quorum. Quorum
>         in NODE
>             represents the server-side enforced quorum.
>
>             The <path> shown in each NODE represents the physical
>         location of
>             the file (similar to current glusterfs.pathinfo) because
>         this xattr
>             can be retrieved for a particular file using getxattr. This
>         is nice,
>             but we can remove it for now if it's difficult to implement.
>
>             We can decide to have a verbose string or try to omit some
>         fields
>             when not strictly necessary. For example, if there are no
>         arbiters,
>             we can omit the 'arbiter' tag instead of writing 'arbiter=0'. We
>             could also implicitly compute 'dist' and 'rep' from the
>         number of
>             elements contained between '()'.
>
>             What do you think ?
>
>
>         Quite a few people are already familiar with path-info. So I am
>         of the
>         opinion that we give this information for that xattr itself.
>         This xattr
>         hasn't changed after quorum/arbiter/shard came in, so may be
>         they should?
>
>
>     Not sure how easy would it be to change the format of path-info to
>     incorporate the new information without breaking existing features
>     or even user scripts based on it. Maybe a new xattr would be easier
>     to implement and adapt.
>
>
> Probably.
>
>
>
>     I missed one important thing in the format: an xlator may have
>     per-subvolume information. This information can be placed just
>     before each subvolume information:
>
>        DHT[dist=2,quorum=1](
>           [hash-range=0x00000000-0x7fffffff]AFR[...](...),
>           [hash-range=0x80000000-0xffffffff]AFR[...](...)
>        )
>
>
> Yes, makes sense.
>
> In general I am better at solving problems someone faces, because things
> will be more concrete. Do you think it is better to wait until the first
> consumer of this functionality comes along and gives their inputs about
> what would be nice to have vs must have? At the moment I am not sure how
> to distinguish what must be there vs what is nice to have :-(.

The good thing is that using this format we can easily start with bare 
minimum information, like this:

    DHT(
       AFR(
          NODE[uuid=<UUID1>],
          NODE[uuid=<UUID2>],
          NODE[uuid=<UUID3>]
       ),
       AFR(
          NODE[uuid=<UUID1>],
          NODE[uuid=<UUID2>],
          NODE[uuid=<UUID3>]
       )
    )

And add more information as it is needed, since it won't break backward 
compatibility.

Xavi

>
>
>     Xavi
>
>
>
>
>             Xavi
>
>
>
>
>                     Xavi
>
>
>
>
>                         On Wed, Jun 21, 2017 at 2:08 PM, Karthik Subrahmanya
>                         <ksubrahm at redhat.com
>         <mailto:ksubrahm at redhat.com> <mailto:ksubrahm at redhat.com
>         <mailto:ksubrahm at redhat.com>>
>                 <mailto:ksubrahm at redhat.com <mailto:ksubrahm at redhat.com>
>         <mailto:ksubrahm at redhat.com <mailto:ksubrahm at redhat.com>>>
>                         <mailto:ksubrahm at redhat.com
>         <mailto:ksubrahm at redhat.com> <mailto:ksubrahm at redhat.com
>         <mailto:ksubrahm at redhat.com>>
>                 <mailto:ksubrahm at redhat.com <mailto:ksubrahm at redhat.com>
>         <mailto:ksubrahm at redhat.com <mailto:ksubrahm at redhat.com>>>>> wrote:
>
>
>
>                             On Wed, Jun 21, 2017 at 1:56 PM, Xavier
>         Hernandez
>                             <xhernandez at datalab.es
>         <mailto:xhernandez at datalab.es>
>                 <mailto:xhernandez at datalab.es
>         <mailto:xhernandez at datalab.es>> <mailto:xhernandez at datalab.es
>         <mailto:xhernandez at datalab.es>
>                 <mailto:xhernandez at datalab.es
>         <mailto:xhernandez at datalab.es>>>
>                         <mailto:xhernandez at datalab.es
>         <mailto:xhernandez at datalab.es>
>                 <mailto:xhernandez at datalab.es
>         <mailto:xhernandez at datalab.es>> <mailto:xhernandez at datalab.es
>         <mailto:xhernandez at datalab.es>
>                 <mailto:xhernandez at datalab.es
>         <mailto:xhernandez at datalab.es>>>>>
>                         wrote:
>
>                                 That's ok. I'm currently unable to write
>         a patch for
>                         this on ec.
>
>                             Sunil is working on this patch.
>
>                             ~Karthik
>
>                                 If no one can do it, I can try to do it
>         in 6 - 7
>                 hours...
>
>                                 Xavi
>
>
>                                 On Wednesday, June 21, 2017 09:48 CEST,
>         Pranith
>                 Kumar
>                         Karampuri
>                                 <pkarampu at redhat.com
>         <mailto:pkarampu at redhat.com>
>                 <mailto:pkarampu at redhat.com
>         <mailto:pkarampu at redhat.com>> <mailto:pkarampu at redhat.com
>         <mailto:pkarampu at redhat.com>
>                 <mailto:pkarampu at redhat.com <mailto:pkarampu at redhat.com>>>
>                         <mailto:pkarampu at redhat.com
>         <mailto:pkarampu at redhat.com> <mailto:pkarampu at redhat.com
>         <mailto:pkarampu at redhat.com>>
>                 <mailto:pkarampu at redhat.com <mailto:pkarampu at redhat.com>
>         <mailto:pkarampu at redhat.com <mailto:pkarampu at redhat.com>>>>> wrote:
>
>
>
>                                     On Wed, Jun 21, 2017 at 1:00 PM, Xavier
>                 Hernandez
>                                     <xhernandez at datalab.es
>         <mailto:xhernandez at datalab.es>
>                 <mailto:xhernandez at datalab.es
>         <mailto:xhernandez at datalab.es>>
>                             <mailto:xhernandez at datalab.es
>         <mailto:xhernandez at datalab.es>
>                 <mailto:xhernandez at datalab.es
>         <mailto:xhernandez at datalab.es>>> <mailto:xhernandez at datalab.es
>         <mailto:xhernandez at datalab.es>
>                 <mailto:xhernandez at datalab.es
>         <mailto:xhernandez at datalab.es>>
>                             <mailto:xhernandez at datalab.es
>         <mailto:xhernandez at datalab.es>
>                 <mailto:xhernandez at datalab.es
>         <mailto:xhernandez at datalab.es>>>>> wrote:
>
>                                         I'm ok with reverting node-uuid
>         content
>                 to the
>                             previous
>                                         format and create a new xattr
>         for the
>                 new format.
>                                         Currently, only rebalance will
>         use it.
>
>                                         Only thing to consider is what can
>                 happen if we
>                             have a
>                                         half upgraded cluster where some
>         clients
>                 have
>                             this change
>                                         and some not. Can rebalance work
>         in this
>                             situation ? if
>                                         so, could there be any issue ?
>
>
>                                     I think there shouldn't be any problem,
>                 because this is
>                                     in-memory xattr so layers below
>         afr/ec will
>                 only see
>                             node-uuid
>                                     xattr.
>                                     This also gives us a chance to do
>         whatever
>                 we want
>                             to do in
>                                     future with this xattr without any
>         problems
>                 about
>                             backward
>                                     compatibility.
>
>                                     You can check
>
>
>
>         https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/src/afr-inode-read.c@1507
>         <https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/src/afr-inode-read.c@1507>
>
>         <https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/src/afr-inode-read.c@1507
>         <https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/src/afr-inode-read.c@1507>>
>
>
>         <https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/src/afr-inode-read.c@1507
>         <https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/src/afr-inode-read.c@1507>
>
>         <https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/src/afr-inode-read.c@1507
>         <https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/src/afr-inode-read.c@1507>>>
>
>
>
>         <https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/src/afr-inode-read.c@1507
>         <https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/src/afr-inode-read.c@1507>
>
>         <https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/src/afr-inode-read.c@1507
>         <https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/src/afr-inode-read.c@1507>>
>
>
>         <https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/src/afr-inode-read.c@1507
>         <https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/src/afr-inode-read.c@1507>
>
>         <https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/src/afr-inode-read.c@1507
>         <https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/src/afr-inode-read.c@1507>>>>
>                                     for how karthik implemented this in AFR
>                 (this got merged
>                                     accidentally yesterday, but looks
>         like this
>                 is what
>                             we are
>                                     settling on)
>
>
>
>                                         Xavi
>
>
>                                         On Wednesday, June 21, 2017
>         06:56 CEST,
>                 Pranith
>                             Kumar
>                                         Karampuri <pkarampu at redhat.com
>         <mailto:pkarampu at redhat.com>
>                 <mailto:pkarampu at redhat.com <mailto:pkarampu at redhat.com>>
>                             <mailto:pkarampu at redhat.com
>         <mailto:pkarampu at redhat.com>
>                 <mailto:pkarampu at redhat.com <mailto:pkarampu at redhat.com>>>
>                                         <mailto:pkarampu at redhat.com
>         <mailto:pkarampu at redhat.com>
>                 <mailto:pkarampu at redhat.com <mailto:pkarampu at redhat.com>>
>                             <mailto:pkarampu at redhat.com
>         <mailto:pkarampu at redhat.com>
>                 <mailto:pkarampu at redhat.com
>         <mailto:pkarampu at redhat.com>>>>> wrote:
>
>
>
>                                             On Wed, Jun 21, 2017 at
>         10:07 AM, Nithya
>                                 Balachandran
>                                             <nbalacha at redhat.com
>         <mailto:nbalacha at redhat.com>
>                 <mailto:nbalacha at redhat.com <mailto:nbalacha at redhat.com>>
>                                 <mailto:nbalacha at redhat.com
>         <mailto:nbalacha at redhat.com>
>                 <mailto:nbalacha at redhat.com
>         <mailto:nbalacha at redhat.com>>> <mailto:nbalacha at redhat.com
>         <mailto:nbalacha at redhat.com>
>                 <mailto:nbalacha at redhat.com <mailto:nbalacha at redhat.com>>
>                                 <mailto:nbalacha at redhat.com
>         <mailto:nbalacha at redhat.com>
>                 <mailto:nbalacha at redhat.com
>         <mailto:nbalacha at redhat.com>>>>> wrote:
>
>
>                                                 On 20 June 2017 at
>         20:38, Aravinda
>                                                 <avishwan at redhat.com
>         <mailto:avishwan at redhat.com>
>                 <mailto:avishwan at redhat.com <mailto:avishwan at redhat.com>>
>                                 <mailto:avishwan at redhat.com
>         <mailto:avishwan at redhat.com>
>                 <mailto:avishwan at redhat.com
>         <mailto:avishwan at redhat.com>>> <mailto:avishwan at redhat.com
>         <mailto:avishwan at redhat.com>
>                 <mailto:avishwan at redhat.com <mailto:avishwan at redhat.com>>
>                                 <mailto:avishwan at redhat.com
>         <mailto:avishwan at redhat.com>
>                 <mailto:avishwan at redhat.com
>         <mailto:avishwan at redhat.com>>>>> wrote:
>
>                                                     On 06/20/2017 06:02
>         PM, Pranith
>                                 Kumar Karampuri
>                                                     wrote:
>
>                                                         Xavi, Aravinda
>         and I had a
>                                     discussion on
>                                                         #gluster-dev and we
>                 agreed to go
>                                     with the format
>                                                         Aravinda
>         suggested for
>                 now and
>                                     in future we
>                                                         wanted some more
>         changes
>                 for dht
>                                     to detect which
>                                                         subvolume went
>         down came
>                 back
>                                     up, at that time
>                                                         we will revisit
>         the solution
>                                     suggested by Xavi.
>
>                                                         Susanth is doing
>         the dht
>                 changes
>                                                         Aravinda is doing
>                 geo-rep changes
>
>                                                     Done. Geo-rep patch
>         sent for
>                 review
>
>                 https://review.gluster.org/17582
>         <https://review.gluster.org/17582>
>         <https://review.gluster.org/17582
>         <https://review.gluster.org/17582>>
>                                 <https://review.gluster.org/17582
>         <https://review.gluster.org/17582>
>                 <https://review.gluster.org/17582
>         <https://review.gluster.org/17582>>>
>
>                 <https://review.gluster.org/17582
>         <https://review.gluster.org/17582>
>         <https://review.gluster.org/17582
>         <https://review.gluster.org/17582>>
>                                 <https://review.gluster.org/17582
>         <https://review.gluster.org/17582>
>                 <https://review.gluster.org/17582
>         <https://review.gluster.org/17582>>>>
>
>
>
>                                                 The proposed changes to the
>                 node-uuid
>                                 behaviour
>                                                 (while good) are going
>         to break
>                 tiering
>                                 . Tiering
>                                                 changes will take a
>         little more
>                 time to
>                                 be coded and
>                                                 tested.
>
>                                                 As this is a regression
>         for 3.11
>                 and a
>                                 blocker for
>                                                 3.11.1, I suggest we go
>         back to
>                 the original
>                                                 node-uuid behaviour for
>         now so as to
>                                 unblock the
>                                                 release and target the
>         proposed
>                 changes
>                                 for the next
>                                                 3.11 releases.
>
>
>                                             Let me see if I understand
>         the changes
>                                 correctly. We are
>                                             restoring the behavior of
>         node-uuid
>                 xattr
>                                 and adding a
>                                             new xattr for parallel
>         rebalance for
>                 both
>                                 afr and ec,
>                                             correct? Otherwise that is
>         one more
>                                 regression. If yes,
>                                             we will also wait for Xavi's
>         inputs.
>                 Jeff
>                                 accidentally
>                                             merged the afr patch
>         yesterday which
>                 does
>                                 these changes.
>                                             If everyone is in agreement,
>         we will
>                 leave
>                                 it as is and
>                                             add similar changes in ec as
>         well.
>                 If we are
>                                 not in
>                                             agreement, then we will let the
>                 discussion
>                                 progress :-)
>
>
>
>
>                                                 Regards,
>                                                 Nithya
>
>                                                     --
>                                                     Aravinda
>
>
>                                                         Thanks to all of you
>                 guys for
>                                     the discussions!
>
>                                                         On Tue, Jun 20,
>         2017 at
>                 5:05 PM,
>                                     Xavier
>                                                         Hernandez
>                 <xhernandez at datalab.es <mailto:xhernandez at datalab.es>
>         <mailto:xhernandez at datalab.es <mailto:xhernandez at datalab.es>>
>                                     <mailto:xhernandez at datalab.es
>         <mailto:xhernandez at datalab.es>
>                 <mailto:xhernandez at datalab.es
>         <mailto:xhernandez at datalab.es>>>
>
>                 <mailto:xhernandez at datalab.es
>         <mailto:xhernandez at datalab.es> <mailto:xhernandez at datalab.es
>         <mailto:xhernandez at datalab.es>>
>                                     <mailto:xhernandez at datalab.es
>         <mailto:xhernandez at datalab.es>
>                 <mailto:xhernandez at datalab.es
>         <mailto:xhernandez at datalab.es>>>>> wrote:
>
>                                                             Hi Aravinda,
>
>                                                             On 20/06/17
>         12:42,
>                 Aravinda
>                                     wrote:
>
>                                                                 I think
>                 following format
>                                     can be easily
>                                                                 adopted
>         by all
>                 components
>
>                                                                 UUIDs of a
>                 subvolume are
>                                     seperated by
>                                                                 space and
>                 subvolumes are
>                                     separated
>                                                                 by comma
>
>                                                                 For example,
>                 node1 and
>                                     node2 are replica
>                                                                 with U1
>         and U2 UUIDs
>
>         respectively and
>                                                                 node3
>         and node4 are
>                                     replica with U3 and
>                                                                 U4 UUIDs
>                 respectively
>
>
>         node-uuid can
>                 return "U1
>                                     U2,U3 U4"
>
>
>                                                             While this
>         is ok for
>                 current
>                                     implementation,
>                                                             I think this
>         can be
>                                     insufficient if there
>                                                             are more
>         layers of
>                 xlators
>                                     that require to
>                                                             indicate
>         some sort of
>                                     grouping. Some
>
>         representation that can
>                                     represent hierarchy
>                                                             would be
>         better. For
>                                     example: "(U1 U2) (U3
>                                                             U4)" (we can use
>                 spaces or
>                                     comma as a
>                                                             separator).
>
>
>
>                                                                 Geo-rep can
>                 split by ","
>                                     and then split
>                                                                 by space and
>                 take first UUID
>                                                                 DHT can
>         split
>                 the value
>                                     by space or
>                                                                 comma
>         and get unique
>                                     UUIDs list
>
>
>                                                             This doesn't
>         solve the
>                                     problem I described
>                                                             in the previous
>                 email. Some
>                                     more logic will
>                                                             need to be
>         added to
>                 avoid
>                                     more than one node
>                                                             from each
>                 replica-set to be
>                                     active. If we
>                                                             have some
>         explicit
>                 hierarchy
>                                     information in
>                                                             the
>         node-uuid value,
>                 more
>                                     decisions can be
>                                                             taken.
>
>                                                             An initial
>         proposal
>                 I made
>                                     was this:
>
>
>                 DHT[2](AFR[2,0](NODE(U1),
>                                     NODE(U2)),
>
>         AFR[2,0](NODE(U1),
>                 NODE(U2)))
>
>                                                             This is
>         harder to
>                 parse, but
>                                     gives a lot of
>                                                             information:
>         DHT with 2
>                                     subvolumes, each
>                                                             subvolume is
>         an AFR with
>                                     replica 2 and no
>                                                             arbiters.
>         It's also
>                 easily
>                                     extensible with
>                                                             any new
>         xlator that
>                 changes
>                                     the layout.
>
>                                                             However
>         maybe this
>                 is not
>                                     the moment to do
>                                                             this, and
>         probably
>                 we could
>                                     implement this
>                                                             in a new
>         xattr with
>                 a better
>                                     name.
>
>                                                             Xavi
>
>
>
>                                                                 Another
>         question is
>                                     about the behavior
>                                                                 when a
>         node is down,
>                                     existing
>
>         node-uuid xattr
>                 will not
>                                     return that
>                                                                 UUID if
>         a node
>                 is down.
>                                     What is the
>                                                                 behavior
>         with the
>                                     proposed xattr?
>
>                                                                 Let me
>         know your
>                 thoughts.
>
>                                                                 regards
>                                                                 Aravinda VK
>
>                                                                 On
>         06/20/2017
>                 03:06 PM,
>                                     Aravinda wrote:
>
>                                                                     Hi Xavi,
>
>                                                                     On
>                 06/20/2017 02:51
>                                     PM, Xavier
>
>         Hernandez wrote:
>
>
>         Hi Aravinda,
>
>
>         On 20/06/17
>                                     11:05, Pranith Kumar
>
>                 Karampuri wrote:
>
>
>                 Adding more
>                                     people to get a
>
>                 consensus
>                                     about this.
>
>
>             On
>                 Tue, Jun
>                                     20, 2017 at 1:49
>
>             PM,
>                 Aravinda
>
>                                     <avishwan at redhat.com
>         <mailto:avishwan at redhat.com>
>                 <mailto:avishwan at redhat.com
>         <mailto:avishwan at redhat.com>> <mailto:avishwan at redhat.com
>         <mailto:avishwan at redhat.com>
>                 <mailto:avishwan at redhat.com <mailto:avishwan at redhat.com>>>
>
>                                     <mailto:avishwan at redhat.com
>         <mailto:avishwan at redhat.com>
>                 <mailto:avishwan at redhat.com <mailto:avishwan at redhat.com>>
>                                     <mailto:avishwan at redhat.com
>         <mailto:avishwan at redhat.com>
>                 <mailto:avishwan at redhat.com <mailto:avishwan at redhat.com>>>>
>
>                                     <mailto:avishwan at redhat.com
>         <mailto:avishwan at redhat.com>
>                 <mailto:avishwan at redhat.com
>         <mailto:avishwan at redhat.com>> <mailto:avishwan at redhat.com
>         <mailto:avishwan at redhat.com>
>                 <mailto:avishwan at redhat.com <mailto:avishwan at redhat.com>>>
>
>                                     <mailto:avishwan at redhat.com
>         <mailto:avishwan at redhat.com>
>                 <mailto:avishwan at redhat.com <mailto:avishwan at redhat.com>>
>                                     <mailto:avishwan at redhat.com
>         <mailto:avishwan at redhat.com>
>                 <mailto:avishwan at redhat.com
>         <mailto:avishwan at redhat.com>>>>>>
>
>             wrote:
>
>
>
>                 regards
>
>                 Aravinda VK
>
>
>
>                 On
>                                     06/20/2017 01:26 PM,
>
>             Xavier
>                                     Hernandez wrote:
>
>
>                     Hi
>                                     Pranith,
>
>
>                     adding
>
>                                     gluster-devel, Kotresh and
>
>                 Aravinda,
>
>
>                     On
>                                     20/06/17 09:45,
>
>             Pranith
>                                     Kumar Karampuri wrote:
>
>
>
>
>                                     On Tue, Jun 20,
>
>             2017
>                 at 1:12
>                                     PM, Xavier
>
>                 Hernandez
>
>
>                                     <xhernandez at datalab.es
>         <mailto:xhernandez at datalab.es>
>                 <mailto:xhernandez at datalab.es
>         <mailto:xhernandez at datalab.es>> <mailto:xhernandez at datalab.es
>         <mailto:xhernandez at datalab.es>
>                 <mailto:xhernandez at datalab.es
>         <mailto:xhernandez at datalab.es>>>
>
>                                     <mailto:xhernandez at datalab.es
>         <mailto:xhernandez at datalab.es>
>                 <mailto:xhernandez at datalab.es
>         <mailto:xhernandez at datalab.es>>
>                                     <mailto:xhernandez at datalab.es
>         <mailto:xhernandez at datalab.es>
>                 <mailto:xhernandez at datalab.es
>         <mailto:xhernandez at datalab.es>>>>
>
>                                     <mailto:xhernandez at datalab.es
>         <mailto:xhernandez at datalab.es>
>                 <mailto:xhernandez at datalab.es
>         <mailto:xhernandez at datalab.es>>
>                                     <mailto:xhernandez at datalab.es
>         <mailto:xhernandez at datalab.es>
>                 <mailto:xhernandez at datalab.es
>         <mailto:xhernandez at datalab.es>>>
>                                     <mailto:xhernandez at datalab.es
>         <mailto:xhernandez at datalab.es>
>                 <mailto:xhernandez at datalab.es
>         <mailto:xhernandez at datalab.es>>
>                                     <mailto:xhernandez at datalab.es
>         <mailto:xhernandez at datalab.es>
>                 <mailto:xhernandez at datalab.es
>         <mailto:xhernandez at datalab.es>>>>>
>
>
>                                     <mailto:xhernandez at datalab.es
>         <mailto:xhernandez at datalab.es>
>                 <mailto:xhernandez at datalab.es
>         <mailto:xhernandez at datalab.es>>
>                                     <mailto:xhernandez at datalab.es
>         <mailto:xhernandez at datalab.es>
>                 <mailto:xhernandez at datalab.es
>         <mailto:xhernandez at datalab.es>>>
>                                     <mailto:xhernandez at datalab.es
>         <mailto:xhernandez at datalab.es>
>                 <mailto:xhernandez at datalab.es
>         <mailto:xhernandez at datalab.es>>
>                                     <mailto:xhernandez at datalab.es
>         <mailto:xhernandez at datalab.es>
>                 <mailto:xhernandez at datalab.es
>         <mailto:xhernandez at datalab.es>>>>
>
>
>                                     <mailto:xhernandez at datalab.es
>         <mailto:xhernandez at datalab.es>
>                 <mailto:xhernandez at datalab.es
>         <mailto:xhernandez at datalab.es>>
>                                     <mailto:xhernandez at datalab.es
>         <mailto:xhernandez at datalab.es>
>                 <mailto:xhernandez at datalab.es
>         <mailto:xhernandez at datalab.es>>>
>                                     <mailto:xhernandez at datalab.es
>         <mailto:xhernandez at datalab.es>
>                 <mailto:xhernandez at datalab.es
>         <mailto:xhernandez at datalab.es>>
>                                     <mailto:xhernandez at datalab.es
>         <mailto:xhernandez at datalab.es>
>                 <mailto:xhernandez at datalab.es
>         <mailto:xhernandez at datalab.es>>>>>>>
>
>             wrote:
>
>
>                                         On 20/06/17
>
>             09:31,
>                                     Pranith Kumar
>
>                 Karampuri wrote:
>
>
>                                             The way
>
>                                     geo-replication works is:
>
>                                             On each
>
>                 machine, it
>                                     does getxattr of
>
>                 node-uuid and
>
>                                     check if its
>
>                                             own uuid
>
>                                             is
>
>                 present in
>                                     the list. If it
>
>             is
>                 present
>                                     then it
>
>                                     will consider
>
>                                             it active
>
>
>                 otherwise it
>                                     will be
>
>                 considered
>                                     passive. With this
>
>                                     change we are
>
>                                             giving
>
>                                             all
>
>             uuids
>                                     instead of first-up
>
>                 subvolume.
>                                     So all
>
>                                     machines think
>
>                                             they are
>
>                                             ACTIVE
>
>                 which is bad
>                                     apparently. So
>
>             that
>                 is the
>
>                                     reason. Even I
>
>                                             felt bad
>
>                                             that we
>
>             are
>                 doing
>                                     this change.
>
>
>
>                                         And what
>
>             about
>                                     changing the content
>
>             of
>                 node-uuid to
>
>                                     include some
>
>                                         sort of
>
>                 hierarchy ?
>
>
>                                         for example:
>
>
>                                         a single brick:
>
>
>                                         NODE(<guid>)
>
>
>                                         AFR/EC:
>
>
>
>                                     AFR[2](NODE(<guid>),
>
>                 NODE(<guid>))
>
>
>                                     EC[3,1](NODE(<guid>),
>
>                                     NODE(<guid>), NODE(<guid>))
>
>
>                                         DHT:
>
>
>
>                                     DHT[2](AFR[2](NODE(<guid>),
>
>                 NODE(<guid>)),
>
>                                     AFR[2](NODE(<guid>),
>
>                                         NODE(<guid>)))
>
>
>                                         This gives a
>
>             lot of
>                                     information that can
>
>             be
>                 used to
>
>             take the
>
>                                         appropriate
>
>                 decisions.
>
>
>
>                                     I guess that is
>
>             not
>                 backward
>                                     compatible.
>
>                 Shall I CC
>
>                                     gluster-devel and
>
>                                     Kotresh/Aravinda?
>
>
>
>                     Is
>                                     the change we did
>
>             backward
>                                     compatible ? if we
>
>             only
>                 require
>
>                     the
>                                     first field to
>
>             be a
>                 GUID to
>                                     support
>
>             backward
>                                     compatibility,
>
>                     we
>                                     can use something
>
>             like
>                 this:
>
>
>                 No. But
>                                     the necessary
>
>                 change can
>                                     be made to
>
>                 Geo-rep code
>                                     as well if
>
>                 format
>                                     is changed, Since
>
>             all
>                 these
>                                     are built/shipped
>
>                 together.
>
>
>                 Geo-rep
>                                     uses node-id as
>
>             follows,
>
>
>                 list =
>                                     listxattr(node-uuid)
>
>                                     active_node_uuids =
>
>                                     list.split(SPACE)
>
>                                     active_node_flag = True
>
>             if
>                                     self.node_id exists in
>
>                                     active_node_uuids
>
>                 else False
>
>
>
>         How was this
>                                     case solved ?
>
>
>         suppose
>                 we have
>                                     three servers
>
>         and 2
>                 bricks in
>                                     each server. A
>
>         replicated
>                                     volume is created
>
>         using the
>                                     following command:
>
>
>         gluster
>                 volume
>                                     create test
>
>         replica 2
>                                     server1:/brick1
>
>                 server2:/brick1
>
>                 server2:/brick2
>                                     server3:/brick1
>
>                 server3:/brick1
>                                     server1:/brick2
>
>
>         In this
>                 case we
>                                     have three
>
>                 replica-sets:
>
>                                                                         *
>                                     server1:/brick1 server2:/brick1
>                                                                         *
>                                     server2:/brick2 server3:/brick1
>                                                                         *
>                                     server3:/brick2 server2:/brick2
>
>
>         Old AFR
>                                     implementation for
>
>                 node-uuid always
>                                     returned the
>
>         uuid of the
>
>         node of the
>                                     first brick, so in
>
>         this case we
>                                     will get the uuid
>
>         of the
>
>         three nodes
>                                     because all of them
>
>         are the
>                 first
>                                     brick of a
>
>         replica-set.
>
>                                                                         Does
>                 this mean
>                                     that with this
>
>                 configuration
>                                     all nodes are
>
>         active ? Is
>
>         this a
>                 problem ?
>                                     Is there any
>
>         other
>                 check to
>                                     avoid this
>
>         situation if
>
>         it's not
>                 good ?
>
>                                                                     Yes
>         all Geo-rep
>                                     workers will become
>
>         Active and
>                                     participate in syncing.
>
>         Since changelogs
>                                     will have the same
>
>         information in
>                                     replica bricks this
>                                                                     will
>         lead to
>                                     duplicate syncing and
>
>         consuming
>                 network
>                                     bandwidth.
>
>
>         Node-uuid based
>                                     Active worker is the
>                                                                     default
>
>
>
>
> --
> Pranith



More information about the Gluster-devel mailing list