[Gluster-devel] geo-rep regression because of node-uuid change
Xavier Hernandez
xhernandez at datalab.es
Fri Jul 7 09:35:15 UTC 2017
On 07/07/17 11:25, Pranith Kumar Karampuri wrote:
>
>
> On Fri, Jul 7, 2017 at 2:46 PM, Xavier Hernandez <xhernandez at datalab.es
> <mailto:xhernandez at datalab.es>> wrote:
>
> On 07/07/17 10:12, Pranith Kumar Karampuri wrote:
>
>
>
> On Fri, Jul 7, 2017 at 1:13 PM, Xavier Hernandez
> <xhernandez at datalab.es <mailto:xhernandez at datalab.es>
> <mailto:xhernandez at datalab.es <mailto:xhernandez at datalab.es>>>
> wrote:
>
> Hi Pranith,
>
> On 05/07/17 12:28, Pranith Kumar Karampuri wrote:
>
>
>
> On Tue, Jul 4, 2017 at 2:26 PM, Xavier Hernandez
> <xhernandez at datalab.es <mailto:xhernandez at datalab.es>
> <mailto:xhernandez at datalab.es <mailto:xhernandez at datalab.es>>
> <mailto:xhernandez at datalab.es
> <mailto:xhernandez at datalab.es> <mailto:xhernandez at datalab.es
> <mailto:xhernandez at datalab.es>>>>
> wrote:
>
> Hi Pranith,
>
> On 03/07/17 08:33, Pranith Kumar Karampuri wrote:
>
> Xavi,
> Now that the change has been reverted, we can
> resume this
> discussion and decide on the exact format that
> considers, tier, dht,
> afr, ec. People working geo-rep/dht/afr/ec had
> an internal
> discussion
> and we all agreed that this proposal would be a
> good way
> forward. I
> think once we agree on the format and decide on
> the initial
> encoding/decoding functions of the xattr and
> this change is
> merged, we
> can send patches on afr/ec/dht and geo-rep to
> take it to
> closure.
>
> Could you propose the new format you have in
> mind that
> considers
> all of
> the xlators?
>
>
> My idea was to create a new xattr not bound to any
> particular
> function but which could give enough information to
> be used
> in many
> places.
>
> Currently we have another attribute called
> glusterfs.pathinfo that
> returns hierarchical information about the location of a
> file. Maybe
> we can extend this to unify all these attributes
> into a single
> feature that could be used for multiple purposes.
>
> Since we have time to discuss it, I would like to
> design it with
> more information than we already talked.
>
> First of all, the amount of information that this
> attribute can
> contain is quite big if we expect to have volumes with
> thousands of
> bricks. Even in the most simple case of returning
> only an
> UUID, we
> can easily go beyond the limit of 64KB.
>
> Consider also, for example, what shard should return
> when
> pathinfo
> is requested for a file. Probably it should return a
> list of
> shards,
> each one with all its associated pathinfo. We are
> talking
> about big
> amounts of data here.
>
> I think this kind of information doesn't fit very
> well in an
> extended attribute. Another think to consider is
> that most
> probably
> the requester of the data only needs a fragment of
> it, so we are
> generating big amounts of data only to be parsed and
> reduced
> later,
> dismissing most of it.
>
> What do you think about using a very special virtual
> file to
> manage
> all this information ? it could be easily read using
> normal read
> fops, so it could manage big amounts of data easily.
> Also,
> accessing
> only to some parts of the file we could go directly
> where we
> want,
> avoiding the read of all remaining data.
>
> A very basic idea could be this:
>
> Each xlator would have a reserved area of the file.
> We can
> reserve
> up to 4GB per xlator (32 bits). The remaining 32
> bits of the
> offset
> would indicate the xlator we want to access.
>
> At offset 0 we have generic information about the
> volume.
> One of the
> the things that this information should include is a
> basic
> hierarchy
> of the whole volume and the offset for each xlator.
>
> After reading this, the user will seek to the
> desired offset and
> read the information related to the xlator it is
> interested in.
>
> All the information should be stored in a format easily
> extensible
> that will be kept compatible even if new information is
> added in the
> future (for example doing special mappings of the 32
> bits
> offsets
> reserved for the xlator).
>
> For example we can reserve the first megabyte of the
> xlator
> area to
> have a mapping of attributes with its respective offset.
>
> I think that using a binary format would simplify
> all this a
> lot.
>
> Do you think this is a way to explore or should I stop
> wasting time
> here ?
>
>
> I think this just became a very big feature :-). Shall
> we just
> live with
> it the way it is now?
>
>
> I supposed it...
>
> Only thing we need to check is if shard needs to handle this
> xattr.
> If so, what it should return ? only the UUID's corresponding
> to the
> first shard or the UUID's of all bricks containing at least one
> shard ? I guess that the first one is enough, but just to be
> sure...
>
> My proposal was to implement a new xattr, for example
> glusterfs.layout, that contains enough information to be
> usable in
> all current use cases.
>
>
> Actually pathinfo is supposed to give this information and it
> already
> has the following format: for a 5x2 distributed-replicate volume
>
>
> Yes, I know. I wanted to unify all information.
>
>
> root at dhcp35-190 - /mnt/v3
> 13:38:12 :) ⚡ getfattr -n trusted.glusterfs.pathinfo d
> # file: d
> trusted.glusterfs.pathinfo="((<DISTRIBUTE:v3-dht>
> (<REPLICATE:v3-replicate-0>
> <POSIX(/home/gfs/v3_0):dhcp35-190.lab.eng.blr.redhat.com:/home/gfs/v3_0/d>
> <POSIX(/home/gfs/v3_1):dhcp35-190.lab.eng.blr.redhat.com:/home/gfs/v3_1/d>)
> (<REPLICATE:v3-replicate-2>
> <POSIX(/home/gfs/v3_5):dhcp35-190.lab.eng.blr.redhat.com:/home/gfs/v3_5/d>
> <POSIX(/home/gfs/v3_4):dhcp35-190.lab.eng.blr.redhat.com:/home/gfs/v3_4/d>)
> (<REPLICATE:v3-replicate-1>
> <POSIX(/home/gfs/v3_3):dhcp35-190.lab.eng.blr.redhat.com:/home/gfs/v3_3/d>
> <POSIX(/home/gfs/v3_2):dhcp35-190.lab.eng.blr.redhat.com:/home/gfs/v3_2/d>)
> (<REPLICATE:v3-replicate-4>
> <POSIX(/home/gfs/v3_8):dhcp35-190.lab.eng.blr.redhat.com:/home/gfs/v3_8/d>
> <POSIX(/home/gfs/v3_9):dhcp35-190.lab.eng.blr.redhat.com:/home/gfs/v3_9/d>)
> (<REPLICATE:v3-replicate-3>
> <POSIX(/home/gfs/v3_6):dhcp35-190.lab.eng.blr.redhat.com:/home/gfs/v3_6/d>
> <POSIX(/home/gfs/v3_7):dhcp35-190.lab.eng.blr.redhat.com:/home/gfs/v3_7/d>))
> (v3-dht-layout (v3-replicate-0 0 858993458) (v3-replicate-1
> 858993459
> 1717986917) (v3-replicate-2 1717986918 2576980376) (v3-replicate-3
> 2576980377 3435973835) (v3-replicate-4 3435973836 4294967295)))"
>
>
> root at dhcp35-190 - /mnt/v3
> 13:38:26 :) ⚡ getfattr -n trusted.glusterfs.pathinfo d/a
> # file: d/a
> trusted.glusterfs.pathinfo="(<DISTRIBUTE:v3-dht>
> (<REPLICATE:v3-replicate-1>
> <POSIX(/home/gfs/v3_3):dhcp35-190.lab.eng.blr.redhat.com:/home/gfs/v3_3/d/a>
> <POSIX(/home/gfs/v3_2):dhcp35-190.lab.eng.blr.redhat.com:/home/gfs/v3_2/d/a>))"
>
>
>
>
> The idea would be that each xlator that makes a significant
> change
> in the way or the place where files are stored, should put
> information in this xattr. The information should include:
>
> * Type (basically AFR, EC, DHT, ...)
> * Basic configuration (replication and arbiter for AFR, data and
> redundancy for EC, # subvolumes for DHT, shard size for
> sharding, ...)
> * Quorum imposed by the xlator
> * UUID data comming from subvolumes (sorted by brick position)
> * It should be easily extensible in the future
>
> The last point is very important to avoid the issues we have
> seen
> now. We must be able to incorporate more information without
> breaking backward compatibility. To do so, we can add tags
> for each
> value.
>
> For example, a distribute 2, replica 2 volume with 1 arbiter
> should
> be represented by this string:
>
> DHT[dist=2,quorum=1](
> AFR[rep=2,arbiter=1,quorum=2](
> NODE[quorum=2,uuid=<UUID1>](<path1>),
> NODE[quorum=2,uuid=<UUID2>](<path2>),
> NODE[quorum=2,uuid=<UUID3>](<path3>)
> ),
> AFR[rep=2,arbiter=1,quorum=2](
> NODE[quorum=2,uuid=<UUID4>](<path4>),
> NODE[quorum=2,uuid=<UUID5>](<path5>),
> NODE[quorum=2,uuid=<UUID6>](<path6>)
> )
> )
>
> Some explanations:
>
> AFAIK DHT doesn't have quorum, so the default is '1'. We may
> decide
> to omit it when it's '1' for any xlator.
>
> Quorum in AFR represents client-side enforced quorum. Quorum
> in NODE
> represents the server-side enforced quorum.
>
> The <path> shown in each NODE represents the physical
> location of
> the file (similar to current glusterfs.pathinfo) because
> this xattr
> can be retrieved for a particular file using getxattr. This
> is nice,
> but we can remove it for now if it's difficult to implement.
>
> We can decide to have a verbose string or try to omit some
> fields
> when not strictly necessary. For example, if there are no
> arbiters,
> we can omit the 'arbiter' tag instead of writing 'arbiter=0'. We
> could also implicitly compute 'dist' and 'rep' from the
> number of
> elements contained between '()'.
>
> What do you think ?
>
>
> Quite a few people are already familiar with path-info. So I am
> of the
> opinion that we give this information for that xattr itself.
> This xattr
> hasn't changed after quorum/arbiter/shard came in, so may be
> they should?
>
>
> Not sure how easy would it be to change the format of path-info to
> incorporate the new information without breaking existing features
> or even user scripts based on it. Maybe a new xattr would be easier
> to implement and adapt.
>
>
> Probably.
>
>
>
> I missed one important thing in the format: an xlator may have
> per-subvolume information. This information can be placed just
> before each subvolume information:
>
> DHT[dist=2,quorum=1](
> [hash-range=0x00000000-0x7fffffff]AFR[...](...),
> [hash-range=0x80000000-0xffffffff]AFR[...](...)
> )
>
>
> Yes, makes sense.
>
> In general I am better at solving problems someone faces, because things
> will be more concrete. Do you think it is better to wait until the first
> consumer of this functionality comes along and gives their inputs about
> what would be nice to have vs must have? At the moment I am not sure how
> to distinguish what must be there vs what is nice to have :-(.
The good thing is that using this format we can easily start with bare
minimum information, like this:
DHT(
AFR(
NODE[uuid=<UUID1>],
NODE[uuid=<UUID2>],
NODE[uuid=<UUID3>]
),
AFR(
NODE[uuid=<UUID1>],
NODE[uuid=<UUID2>],
NODE[uuid=<UUID3>]
)
)
And add more information as it is needed, since it won't break backward
compatibility.
Xavi
>
>
> Xavi
>
>
>
>
> Xavi
>
>
>
>
> Xavi
>
>
>
>
> On Wed, Jun 21, 2017 at 2:08 PM, Karthik Subrahmanya
> <ksubrahm at redhat.com
> <mailto:ksubrahm at redhat.com> <mailto:ksubrahm at redhat.com
> <mailto:ksubrahm at redhat.com>>
> <mailto:ksubrahm at redhat.com <mailto:ksubrahm at redhat.com>
> <mailto:ksubrahm at redhat.com <mailto:ksubrahm at redhat.com>>>
> <mailto:ksubrahm at redhat.com
> <mailto:ksubrahm at redhat.com> <mailto:ksubrahm at redhat.com
> <mailto:ksubrahm at redhat.com>>
> <mailto:ksubrahm at redhat.com <mailto:ksubrahm at redhat.com>
> <mailto:ksubrahm at redhat.com <mailto:ksubrahm at redhat.com>>>>> wrote:
>
>
>
> On Wed, Jun 21, 2017 at 1:56 PM, Xavier
> Hernandez
> <xhernandez at datalab.es
> <mailto:xhernandez at datalab.es>
> <mailto:xhernandez at datalab.es
> <mailto:xhernandez at datalab.es>> <mailto:xhernandez at datalab.es
> <mailto:xhernandez at datalab.es>
> <mailto:xhernandez at datalab.es
> <mailto:xhernandez at datalab.es>>>
> <mailto:xhernandez at datalab.es
> <mailto:xhernandez at datalab.es>
> <mailto:xhernandez at datalab.es
> <mailto:xhernandez at datalab.es>> <mailto:xhernandez at datalab.es
> <mailto:xhernandez at datalab.es>
> <mailto:xhernandez at datalab.es
> <mailto:xhernandez at datalab.es>>>>>
> wrote:
>
> That's ok. I'm currently unable to write
> a patch for
> this on ec.
>
> Sunil is working on this patch.
>
> ~Karthik
>
> If no one can do it, I can try to do it
> in 6 - 7
> hours...
>
> Xavi
>
>
> On Wednesday, June 21, 2017 09:48 CEST,
> Pranith
> Kumar
> Karampuri
> <pkarampu at redhat.com
> <mailto:pkarampu at redhat.com>
> <mailto:pkarampu at redhat.com
> <mailto:pkarampu at redhat.com>> <mailto:pkarampu at redhat.com
> <mailto:pkarampu at redhat.com>
> <mailto:pkarampu at redhat.com <mailto:pkarampu at redhat.com>>>
> <mailto:pkarampu at redhat.com
> <mailto:pkarampu at redhat.com> <mailto:pkarampu at redhat.com
> <mailto:pkarampu at redhat.com>>
> <mailto:pkarampu at redhat.com <mailto:pkarampu at redhat.com>
> <mailto:pkarampu at redhat.com <mailto:pkarampu at redhat.com>>>>> wrote:
>
>
>
> On Wed, Jun 21, 2017 at 1:00 PM, Xavier
> Hernandez
> <xhernandez at datalab.es
> <mailto:xhernandez at datalab.es>
> <mailto:xhernandez at datalab.es
> <mailto:xhernandez at datalab.es>>
> <mailto:xhernandez at datalab.es
> <mailto:xhernandez at datalab.es>
> <mailto:xhernandez at datalab.es
> <mailto:xhernandez at datalab.es>>> <mailto:xhernandez at datalab.es
> <mailto:xhernandez at datalab.es>
> <mailto:xhernandez at datalab.es
> <mailto:xhernandez at datalab.es>>
> <mailto:xhernandez at datalab.es
> <mailto:xhernandez at datalab.es>
> <mailto:xhernandez at datalab.es
> <mailto:xhernandez at datalab.es>>>>> wrote:
>
> I'm ok with reverting node-uuid
> content
> to the
> previous
> format and create a new xattr
> for the
> new format.
> Currently, only rebalance will
> use it.
>
> Only thing to consider is what can
> happen if we
> have a
> half upgraded cluster where some
> clients
> have
> this change
> and some not. Can rebalance work
> in this
> situation ? if
> so, could there be any issue ?
>
>
> I think there shouldn't be any problem,
> because this is
> in-memory xattr so layers below
> afr/ec will
> only see
> node-uuid
> xattr.
> This also gives us a chance to do
> whatever
> we want
> to do in
> future with this xattr without any
> problems
> about
> backward
> compatibility.
>
> You can check
>
>
>
> https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/src/afr-inode-read.c@1507
> <https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/src/afr-inode-read.c@1507>
>
> <https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/src/afr-inode-read.c@1507
> <https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/src/afr-inode-read.c@1507>>
>
>
> <https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/src/afr-inode-read.c@1507
> <https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/src/afr-inode-read.c@1507>
>
> <https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/src/afr-inode-read.c@1507
> <https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/src/afr-inode-read.c@1507>>>
>
>
>
> <https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/src/afr-inode-read.c@1507
> <https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/src/afr-inode-read.c@1507>
>
> <https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/src/afr-inode-read.c@1507
> <https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/src/afr-inode-read.c@1507>>
>
>
> <https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/src/afr-inode-read.c@1507
> <https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/src/afr-inode-read.c@1507>
>
> <https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/src/afr-inode-read.c@1507
> <https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/src/afr-inode-read.c@1507>>>>
> for how karthik implemented this in AFR
> (this got merged
> accidentally yesterday, but looks
> like this
> is what
> we are
> settling on)
>
>
>
> Xavi
>
>
> On Wednesday, June 21, 2017
> 06:56 CEST,
> Pranith
> Kumar
> Karampuri <pkarampu at redhat.com
> <mailto:pkarampu at redhat.com>
> <mailto:pkarampu at redhat.com <mailto:pkarampu at redhat.com>>
> <mailto:pkarampu at redhat.com
> <mailto:pkarampu at redhat.com>
> <mailto:pkarampu at redhat.com <mailto:pkarampu at redhat.com>>>
> <mailto:pkarampu at redhat.com
> <mailto:pkarampu at redhat.com>
> <mailto:pkarampu at redhat.com <mailto:pkarampu at redhat.com>>
> <mailto:pkarampu at redhat.com
> <mailto:pkarampu at redhat.com>
> <mailto:pkarampu at redhat.com
> <mailto:pkarampu at redhat.com>>>>> wrote:
>
>
>
> On Wed, Jun 21, 2017 at
> 10:07 AM, Nithya
> Balachandran
> <nbalacha at redhat.com
> <mailto:nbalacha at redhat.com>
> <mailto:nbalacha at redhat.com <mailto:nbalacha at redhat.com>>
> <mailto:nbalacha at redhat.com
> <mailto:nbalacha at redhat.com>
> <mailto:nbalacha at redhat.com
> <mailto:nbalacha at redhat.com>>> <mailto:nbalacha at redhat.com
> <mailto:nbalacha at redhat.com>
> <mailto:nbalacha at redhat.com <mailto:nbalacha at redhat.com>>
> <mailto:nbalacha at redhat.com
> <mailto:nbalacha at redhat.com>
> <mailto:nbalacha at redhat.com
> <mailto:nbalacha at redhat.com>>>>> wrote:
>
>
> On 20 June 2017 at
> 20:38, Aravinda
> <avishwan at redhat.com
> <mailto:avishwan at redhat.com>
> <mailto:avishwan at redhat.com <mailto:avishwan at redhat.com>>
> <mailto:avishwan at redhat.com
> <mailto:avishwan at redhat.com>
> <mailto:avishwan at redhat.com
> <mailto:avishwan at redhat.com>>> <mailto:avishwan at redhat.com
> <mailto:avishwan at redhat.com>
> <mailto:avishwan at redhat.com <mailto:avishwan at redhat.com>>
> <mailto:avishwan at redhat.com
> <mailto:avishwan at redhat.com>
> <mailto:avishwan at redhat.com
> <mailto:avishwan at redhat.com>>>>> wrote:
>
> On 06/20/2017 06:02
> PM, Pranith
> Kumar Karampuri
> wrote:
>
> Xavi, Aravinda
> and I had a
> discussion on
> #gluster-dev and we
> agreed to go
> with the format
> Aravinda
> suggested for
> now and
> in future we
> wanted some more
> changes
> for dht
> to detect which
> subvolume went
> down came
> back
> up, at that time
> we will revisit
> the solution
> suggested by Xavi.
>
> Susanth is doing
> the dht
> changes
> Aravinda is doing
> geo-rep changes
>
> Done. Geo-rep patch
> sent for
> review
>
> https://review.gluster.org/17582
> <https://review.gluster.org/17582>
> <https://review.gluster.org/17582
> <https://review.gluster.org/17582>>
> <https://review.gluster.org/17582
> <https://review.gluster.org/17582>
> <https://review.gluster.org/17582
> <https://review.gluster.org/17582>>>
>
> <https://review.gluster.org/17582
> <https://review.gluster.org/17582>
> <https://review.gluster.org/17582
> <https://review.gluster.org/17582>>
> <https://review.gluster.org/17582
> <https://review.gluster.org/17582>
> <https://review.gluster.org/17582
> <https://review.gluster.org/17582>>>>
>
>
>
> The proposed changes to the
> node-uuid
> behaviour
> (while good) are going
> to break
> tiering
> . Tiering
> changes will take a
> little more
> time to
> be coded and
> tested.
>
> As this is a regression
> for 3.11
> and a
> blocker for
> 3.11.1, I suggest we go
> back to
> the original
> node-uuid behaviour for
> now so as to
> unblock the
> release and target the
> proposed
> changes
> for the next
> 3.11 releases.
>
>
> Let me see if I understand
> the changes
> correctly. We are
> restoring the behavior of
> node-uuid
> xattr
> and adding a
> new xattr for parallel
> rebalance for
> both
> afr and ec,
> correct? Otherwise that is
> one more
> regression. If yes,
> we will also wait for Xavi's
> inputs.
> Jeff
> accidentally
> merged the afr patch
> yesterday which
> does
> these changes.
> If everyone is in agreement,
> we will
> leave
> it as is and
> add similar changes in ec as
> well.
> If we are
> not in
> agreement, then we will let the
> discussion
> progress :-)
>
>
>
>
> Regards,
> Nithya
>
> --
> Aravinda
>
>
> Thanks to all of you
> guys for
> the discussions!
>
> On Tue, Jun 20,
> 2017 at
> 5:05 PM,
> Xavier
> Hernandez
> <xhernandez at datalab.es <mailto:xhernandez at datalab.es>
> <mailto:xhernandez at datalab.es <mailto:xhernandez at datalab.es>>
> <mailto:xhernandez at datalab.es
> <mailto:xhernandez at datalab.es>
> <mailto:xhernandez at datalab.es
> <mailto:xhernandez at datalab.es>>>
>
> <mailto:xhernandez at datalab.es
> <mailto:xhernandez at datalab.es> <mailto:xhernandez at datalab.es
> <mailto:xhernandez at datalab.es>>
> <mailto:xhernandez at datalab.es
> <mailto:xhernandez at datalab.es>
> <mailto:xhernandez at datalab.es
> <mailto:xhernandez at datalab.es>>>>> wrote:
>
> Hi Aravinda,
>
> On 20/06/17
> 12:42,
> Aravinda
> wrote:
>
> I think
> following format
> can be easily
> adopted
> by all
> components
>
> UUIDs of a
> subvolume are
> seperated by
> space and
> subvolumes are
> separated
> by comma
>
> For example,
> node1 and
> node2 are replica
> with U1
> and U2 UUIDs
>
> respectively and
> node3
> and node4 are
> replica with U3 and
> U4 UUIDs
> respectively
>
>
> node-uuid can
> return "U1
> U2,U3 U4"
>
>
> While this
> is ok for
> current
> implementation,
> I think this
> can be
> insufficient if there
> are more
> layers of
> xlators
> that require to
> indicate
> some sort of
> grouping. Some
>
> representation that can
> represent hierarchy
> would be
> better. For
> example: "(U1 U2) (U3
> U4)" (we can use
> spaces or
> comma as a
> separator).
>
>
>
> Geo-rep can
> split by ","
> and then split
> by space and
> take first UUID
> DHT can
> split
> the value
> by space or
> comma
> and get unique
> UUIDs list
>
>
> This doesn't
> solve the
> problem I described
> in the previous
> email. Some
> more logic will
> need to be
> added to
> avoid
> more than one node
> from each
> replica-set to be
> active. If we
> have some
> explicit
> hierarchy
> information in
> the
> node-uuid value,
> more
> decisions can be
> taken.
>
> An initial
> proposal
> I made
> was this:
>
>
> DHT[2](AFR[2,0](NODE(U1),
> NODE(U2)),
>
> AFR[2,0](NODE(U1),
> NODE(U2)))
>
> This is
> harder to
> parse, but
> gives a lot of
> information:
> DHT with 2
> subvolumes, each
> subvolume is
> an AFR with
> replica 2 and no
> arbiters.
> It's also
> easily
> extensible with
> any new
> xlator that
> changes
> the layout.
>
> However
> maybe this
> is not
> the moment to do
> this, and
> probably
> we could
> implement this
> in a new
> xattr with
> a better
> name.
>
> Xavi
>
>
>
> Another
> question is
> about the behavior
> when a
> node is down,
> existing
>
> node-uuid xattr
> will not
> return that
> UUID if
> a node
> is down.
> What is the
> behavior
> with the
> proposed xattr?
>
> Let me
> know your
> thoughts.
>
> regards
> Aravinda VK
>
> On
> 06/20/2017
> 03:06 PM,
> Aravinda wrote:
>
> Hi Xavi,
>
> On
> 06/20/2017 02:51
> PM, Xavier
>
> Hernandez wrote:
>
>
> Hi Aravinda,
>
>
> On 20/06/17
> 11:05, Pranith Kumar
>
> Karampuri wrote:
>
>
> Adding more
> people to get a
>
> consensus
> about this.
>
>
> On
> Tue, Jun
> 20, 2017 at 1:49
>
> PM,
> Aravinda
>
> <avishwan at redhat.com
> <mailto:avishwan at redhat.com>
> <mailto:avishwan at redhat.com
> <mailto:avishwan at redhat.com>> <mailto:avishwan at redhat.com
> <mailto:avishwan at redhat.com>
> <mailto:avishwan at redhat.com <mailto:avishwan at redhat.com>>>
>
> <mailto:avishwan at redhat.com
> <mailto:avishwan at redhat.com>
> <mailto:avishwan at redhat.com <mailto:avishwan at redhat.com>>
> <mailto:avishwan at redhat.com
> <mailto:avishwan at redhat.com>
> <mailto:avishwan at redhat.com <mailto:avishwan at redhat.com>>>>
>
> <mailto:avishwan at redhat.com
> <mailto:avishwan at redhat.com>
> <mailto:avishwan at redhat.com
> <mailto:avishwan at redhat.com>> <mailto:avishwan at redhat.com
> <mailto:avishwan at redhat.com>
> <mailto:avishwan at redhat.com <mailto:avishwan at redhat.com>>>
>
> <mailto:avishwan at redhat.com
> <mailto:avishwan at redhat.com>
> <mailto:avishwan at redhat.com <mailto:avishwan at redhat.com>>
> <mailto:avishwan at redhat.com
> <mailto:avishwan at redhat.com>
> <mailto:avishwan at redhat.com
> <mailto:avishwan at redhat.com>>>>>>
>
> wrote:
>
>
>
> regards
>
> Aravinda VK
>
>
>
> On
> 06/20/2017 01:26 PM,
>
> Xavier
> Hernandez wrote:
>
>
> Hi
> Pranith,
>
>
> adding
>
> gluster-devel, Kotresh and
>
> Aravinda,
>
>
> On
> 20/06/17 09:45,
>
> Pranith
> Kumar Karampuri wrote:
>
>
>
>
> On Tue, Jun 20,
>
> 2017
> at 1:12
> PM, Xavier
>
> Hernandez
>
>
> <xhernandez at datalab.es
> <mailto:xhernandez at datalab.es>
> <mailto:xhernandez at datalab.es
> <mailto:xhernandez at datalab.es>> <mailto:xhernandez at datalab.es
> <mailto:xhernandez at datalab.es>
> <mailto:xhernandez at datalab.es
> <mailto:xhernandez at datalab.es>>>
>
> <mailto:xhernandez at datalab.es
> <mailto:xhernandez at datalab.es>
> <mailto:xhernandez at datalab.es
> <mailto:xhernandez at datalab.es>>
> <mailto:xhernandez at datalab.es
> <mailto:xhernandez at datalab.es>
> <mailto:xhernandez at datalab.es
> <mailto:xhernandez at datalab.es>>>>
>
> <mailto:xhernandez at datalab.es
> <mailto:xhernandez at datalab.es>
> <mailto:xhernandez at datalab.es
> <mailto:xhernandez at datalab.es>>
> <mailto:xhernandez at datalab.es
> <mailto:xhernandez at datalab.es>
> <mailto:xhernandez at datalab.es
> <mailto:xhernandez at datalab.es>>>
> <mailto:xhernandez at datalab.es
> <mailto:xhernandez at datalab.es>
> <mailto:xhernandez at datalab.es
> <mailto:xhernandez at datalab.es>>
> <mailto:xhernandez at datalab.es
> <mailto:xhernandez at datalab.es>
> <mailto:xhernandez at datalab.es
> <mailto:xhernandez at datalab.es>>>>>
>
>
> <mailto:xhernandez at datalab.es
> <mailto:xhernandez at datalab.es>
> <mailto:xhernandez at datalab.es
> <mailto:xhernandez at datalab.es>>
> <mailto:xhernandez at datalab.es
> <mailto:xhernandez at datalab.es>
> <mailto:xhernandez at datalab.es
> <mailto:xhernandez at datalab.es>>>
> <mailto:xhernandez at datalab.es
> <mailto:xhernandez at datalab.es>
> <mailto:xhernandez at datalab.es
> <mailto:xhernandez at datalab.es>>
> <mailto:xhernandez at datalab.es
> <mailto:xhernandez at datalab.es>
> <mailto:xhernandez at datalab.es
> <mailto:xhernandez at datalab.es>>>>
>
>
> <mailto:xhernandez at datalab.es
> <mailto:xhernandez at datalab.es>
> <mailto:xhernandez at datalab.es
> <mailto:xhernandez at datalab.es>>
> <mailto:xhernandez at datalab.es
> <mailto:xhernandez at datalab.es>
> <mailto:xhernandez at datalab.es
> <mailto:xhernandez at datalab.es>>>
> <mailto:xhernandez at datalab.es
> <mailto:xhernandez at datalab.es>
> <mailto:xhernandez at datalab.es
> <mailto:xhernandez at datalab.es>>
> <mailto:xhernandez at datalab.es
> <mailto:xhernandez at datalab.es>
> <mailto:xhernandez at datalab.es
> <mailto:xhernandez at datalab.es>>>>>>>
>
> wrote:
>
>
> On 20/06/17
>
> 09:31,
> Pranith Kumar
>
> Karampuri wrote:
>
>
> The way
>
> geo-replication works is:
>
> On each
>
> machine, it
> does getxattr of
>
> node-uuid and
>
> check if its
>
> own uuid
>
> is
>
> present in
> the list. If it
>
> is
> present
> then it
>
> will consider
>
> it active
>
>
> otherwise it
> will be
>
> considered
> passive. With this
>
> change we are
>
> giving
>
> all
>
> uuids
> instead of first-up
>
> subvolume.
> So all
>
> machines think
>
> they are
>
> ACTIVE
>
> which is bad
> apparently. So
>
> that
> is the
>
> reason. Even I
>
> felt bad
>
> that we
>
> are
> doing
> this change.
>
>
>
> And what
>
> about
> changing the content
>
> of
> node-uuid to
>
> include some
>
> sort of
>
> hierarchy ?
>
>
> for example:
>
>
> a single brick:
>
>
> NODE(<guid>)
>
>
> AFR/EC:
>
>
>
> AFR[2](NODE(<guid>),
>
> NODE(<guid>))
>
>
> EC[3,1](NODE(<guid>),
>
> NODE(<guid>), NODE(<guid>))
>
>
> DHT:
>
>
>
> DHT[2](AFR[2](NODE(<guid>),
>
> NODE(<guid>)),
>
> AFR[2](NODE(<guid>),
>
> NODE(<guid>)))
>
>
> This gives a
>
> lot of
> information that can
>
> be
> used to
>
> take the
>
> appropriate
>
> decisions.
>
>
>
> I guess that is
>
> not
> backward
> compatible.
>
> Shall I CC
>
> gluster-devel and
>
> Kotresh/Aravinda?
>
>
>
> Is
> the change we did
>
> backward
> compatible ? if we
>
> only
> require
>
> the
> first field to
>
> be a
> GUID to
> support
>
> backward
> compatibility,
>
> we
> can use something
>
> like
> this:
>
>
> No. But
> the necessary
>
> change can
> be made to
>
> Geo-rep code
> as well if
>
> format
> is changed, Since
>
> all
> these
> are built/shipped
>
> together.
>
>
> Geo-rep
> uses node-id as
>
> follows,
>
>
> list =
> listxattr(node-uuid)
>
> active_node_uuids =
>
> list.split(SPACE)
>
> active_node_flag = True
>
> if
> self.node_id exists in
>
> active_node_uuids
>
> else False
>
>
>
> How was this
> case solved ?
>
>
> suppose
> we have
> three servers
>
> and 2
> bricks in
> each server. A
>
> replicated
> volume is created
>
> using the
> following command:
>
>
> gluster
> volume
> create test
>
> replica 2
> server1:/brick1
>
> server2:/brick1
>
> server2:/brick2
> server3:/brick1
>
> server3:/brick1
> server1:/brick2
>
>
> In this
> case we
> have three
>
> replica-sets:
>
> *
> server1:/brick1 server2:/brick1
> *
> server2:/brick2 server3:/brick1
> *
> server3:/brick2 server2:/brick2
>
>
> Old AFR
> implementation for
>
> node-uuid always
> returned the
>
> uuid of the
>
> node of the
> first brick, so in
>
> this case we
> will get the uuid
>
> of the
>
> three nodes
> because all of them
>
> are the
> first
> brick of a
>
> replica-set.
>
> Does
> this mean
> that with this
>
> configuration
> all nodes are
>
> active ? Is
>
> this a
> problem ?
> Is there any
>
> other
> check to
> avoid this
>
> situation if
>
> it's not
> good ?
>
> Yes
> all Geo-rep
> workers will become
>
> Active and
> participate in syncing.
>
> Since changelogs
> will have the same
>
> information in
> replica bricks this
> will
> lead to
> duplicate syncing and
>
> consuming
> network
> bandwidth.
>
>
> Node-uuid based
> Active worker is the
> default
>
>
>
>
> --
> Pranith
More information about the Gluster-devel
mailing list