[Gluster-devel] geo-rep regression because of node-uuid change
Xavier Hernandez
xhernandez at datalab.es
Fri Jul 7 09:16:37 UTC 2017
On 07/07/17 10:12, Pranith Kumar Karampuri wrote:
>
>
> On Fri, Jul 7, 2017 at 1:13 PM, Xavier Hernandez <xhernandez at datalab.es
> <mailto:xhernandez at datalab.es>> wrote:
>
> Hi Pranith,
>
> On 05/07/17 12:28, Pranith Kumar Karampuri wrote:
>
>
>
> On Tue, Jul 4, 2017 at 2:26 PM, Xavier Hernandez
> <xhernandez at datalab.es <mailto:xhernandez at datalab.es>
> <mailto:xhernandez at datalab.es <mailto:xhernandez at datalab.es>>>
> wrote:
>
> Hi Pranith,
>
> On 03/07/17 08:33, Pranith Kumar Karampuri wrote:
>
> Xavi,
> Now that the change has been reverted, we can
> resume this
> discussion and decide on the exact format that
> considers, tier, dht,
> afr, ec. People working geo-rep/dht/afr/ec had an internal
> discussion
> and we all agreed that this proposal would be a good way
> forward. I
> think once we agree on the format and decide on the initial
> encoding/decoding functions of the xattr and this change is
> merged, we
> can send patches on afr/ec/dht and geo-rep to take it to
> closure.
>
> Could you propose the new format you have in mind that
> considers
> all of
> the xlators?
>
>
> My idea was to create a new xattr not bound to any particular
> function but which could give enough information to be used
> in many
> places.
>
> Currently we have another attribute called
> glusterfs.pathinfo that
> returns hierarchical information about the location of a
> file. Maybe
> we can extend this to unify all these attributes into a single
> feature that could be used for multiple purposes.
>
> Since we have time to discuss it, I would like to design it with
> more information than we already talked.
>
> First of all, the amount of information that this attribute can
> contain is quite big if we expect to have volumes with
> thousands of
> bricks. Even in the most simple case of returning only an
> UUID, we
> can easily go beyond the limit of 64KB.
>
> Consider also, for example, what shard should return when
> pathinfo
> is requested for a file. Probably it should return a list of
> shards,
> each one with all its associated pathinfo. We are talking
> about big
> amounts of data here.
>
> I think this kind of information doesn't fit very well in an
> extended attribute. Another think to consider is that most
> probably
> the requester of the data only needs a fragment of it, so we are
> generating big amounts of data only to be parsed and reduced
> later,
> dismissing most of it.
>
> What do you think about using a very special virtual file to
> manage
> all this information ? it could be easily read using normal read
> fops, so it could manage big amounts of data easily. Also,
> accessing
> only to some parts of the file we could go directly where we
> want,
> avoiding the read of all remaining data.
>
> A very basic idea could be this:
>
> Each xlator would have a reserved area of the file. We can
> reserve
> up to 4GB per xlator (32 bits). The remaining 32 bits of the
> offset
> would indicate the xlator we want to access.
>
> At offset 0 we have generic information about the volume.
> One of the
> the things that this information should include is a basic
> hierarchy
> of the whole volume and the offset for each xlator.
>
> After reading this, the user will seek to the desired offset and
> read the information related to the xlator it is interested in.
>
> All the information should be stored in a format easily
> extensible
> that will be kept compatible even if new information is
> added in the
> future (for example doing special mappings of the 32 bits
> offsets
> reserved for the xlator).
>
> For example we can reserve the first megabyte of the xlator
> area to
> have a mapping of attributes with its respective offset.
>
> I think that using a binary format would simplify all this a
> lot.
>
> Do you think this is a way to explore or should I stop
> wasting time
> here ?
>
>
> I think this just became a very big feature :-). Shall we just
> live with
> it the way it is now?
>
>
> I supposed it...
>
> Only thing we need to check is if shard needs to handle this xattr.
> If so, what it should return ? only the UUID's corresponding to the
> first shard or the UUID's of all bricks containing at least one
> shard ? I guess that the first one is enough, but just to be sure...
>
> My proposal was to implement a new xattr, for example
> glusterfs.layout, that contains enough information to be usable in
> all current use cases.
>
>
> Actually pathinfo is supposed to give this information and it already
> has the following format: for a 5x2 distributed-replicate volume
Yes, I know. I wanted to unify all information.
>
> root at dhcp35-190 - /mnt/v3
> 13:38:12 :) ⚡ getfattr -n trusted.glusterfs.pathinfo d
> # file: d
> trusted.glusterfs.pathinfo="((<DISTRIBUTE:v3-dht>
> (<REPLICATE:v3-replicate-0>
> <POSIX(/home/gfs/v3_0):dhcp35-190.lab.eng.blr.redhat.com:/home/gfs/v3_0/d>
> <POSIX(/home/gfs/v3_1):dhcp35-190.lab.eng.blr.redhat.com:/home/gfs/v3_1/d>)
> (<REPLICATE:v3-replicate-2>
> <POSIX(/home/gfs/v3_5):dhcp35-190.lab.eng.blr.redhat.com:/home/gfs/v3_5/d>
> <POSIX(/home/gfs/v3_4):dhcp35-190.lab.eng.blr.redhat.com:/home/gfs/v3_4/d>)
> (<REPLICATE:v3-replicate-1>
> <POSIX(/home/gfs/v3_3):dhcp35-190.lab.eng.blr.redhat.com:/home/gfs/v3_3/d>
> <POSIX(/home/gfs/v3_2):dhcp35-190.lab.eng.blr.redhat.com:/home/gfs/v3_2/d>)
> (<REPLICATE:v3-replicate-4>
> <POSIX(/home/gfs/v3_8):dhcp35-190.lab.eng.blr.redhat.com:/home/gfs/v3_8/d>
> <POSIX(/home/gfs/v3_9):dhcp35-190.lab.eng.blr.redhat.com:/home/gfs/v3_9/d>)
> (<REPLICATE:v3-replicate-3>
> <POSIX(/home/gfs/v3_6):dhcp35-190.lab.eng.blr.redhat.com:/home/gfs/v3_6/d>
> <POSIX(/home/gfs/v3_7):dhcp35-190.lab.eng.blr.redhat.com:/home/gfs/v3_7/d>))
> (v3-dht-layout (v3-replicate-0 0 858993458) (v3-replicate-1 858993459
> 1717986917) (v3-replicate-2 1717986918 2576980376) (v3-replicate-3
> 2576980377 3435973835) (v3-replicate-4 3435973836 4294967295)))"
>
>
> root at dhcp35-190 - /mnt/v3
> 13:38:26 :) ⚡ getfattr -n trusted.glusterfs.pathinfo d/a
> # file: d/a
> trusted.glusterfs.pathinfo="(<DISTRIBUTE:v3-dht>
> (<REPLICATE:v3-replicate-1>
> <POSIX(/home/gfs/v3_3):dhcp35-190.lab.eng.blr.redhat.com:/home/gfs/v3_3/d/a>
> <POSIX(/home/gfs/v3_2):dhcp35-190.lab.eng.blr.redhat.com:/home/gfs/v3_2/d/a>))"
>
>
>
>
> The idea would be that each xlator that makes a significant change
> in the way or the place where files are stored, should put
> information in this xattr. The information should include:
>
> * Type (basically AFR, EC, DHT, ...)
> * Basic configuration (replication and arbiter for AFR, data and
> redundancy for EC, # subvolumes for DHT, shard size for sharding, ...)
> * Quorum imposed by the xlator
> * UUID data comming from subvolumes (sorted by brick position)
> * It should be easily extensible in the future
>
> The last point is very important to avoid the issues we have seen
> now. We must be able to incorporate more information without
> breaking backward compatibility. To do so, we can add tags for each
> value.
>
> For example, a distribute 2, replica 2 volume with 1 arbiter should
> be represented by this string:
>
> DHT[dist=2,quorum=1](
> AFR[rep=2,arbiter=1,quorum=2](
> NODE[quorum=2,uuid=<UUID1>](<path1>),
> NODE[quorum=2,uuid=<UUID2>](<path2>),
> NODE[quorum=2,uuid=<UUID3>](<path3>)
> ),
> AFR[rep=2,arbiter=1,quorum=2](
> NODE[quorum=2,uuid=<UUID4>](<path4>),
> NODE[quorum=2,uuid=<UUID5>](<path5>),
> NODE[quorum=2,uuid=<UUID6>](<path6>)
> )
> )
>
> Some explanations:
>
> AFAIK DHT doesn't have quorum, so the default is '1'. We may decide
> to omit it when it's '1' for any xlator.
>
> Quorum in AFR represents client-side enforced quorum. Quorum in NODE
> represents the server-side enforced quorum.
>
> The <path> shown in each NODE represents the physical location of
> the file (similar to current glusterfs.pathinfo) because this xattr
> can be retrieved for a particular file using getxattr. This is nice,
> but we can remove it for now if it's difficult to implement.
>
> We can decide to have a verbose string or try to omit some fields
> when not strictly necessary. For example, if there are no arbiters,
> we can omit the 'arbiter' tag instead of writing 'arbiter=0'. We
> could also implicitly compute 'dist' and 'rep' from the number of
> elements contained between '()'.
>
> What do you think ?
>
>
> Quite a few people are already familiar with path-info. So I am of the
> opinion that we give this information for that xattr itself. This xattr
> hasn't changed after quorum/arbiter/shard came in, so may be they should?
Not sure how easy would it be to change the format of path-info to
incorporate the new information without breaking existing features or
even user scripts based on it. Maybe a new xattr would be easier to
implement and adapt.
I missed one important thing in the format: an xlator may have
per-subvolume information. This information can be placed just before
each subvolume information:
DHT[dist=2,quorum=1](
[hash-range=0x00000000-0x7fffffff]AFR[...](...),
[hash-range=0x80000000-0xffffffff]AFR[...](...)
)
Xavi
>
>
>
> Xavi
>
>
>
>
> Xavi
>
>
>
>
> On Wed, Jun 21, 2017 at 2:08 PM, Karthik Subrahmanya
> <ksubrahm at redhat.com <mailto:ksubrahm at redhat.com>
> <mailto:ksubrahm at redhat.com <mailto:ksubrahm at redhat.com>>
> <mailto:ksubrahm at redhat.com <mailto:ksubrahm at redhat.com>
> <mailto:ksubrahm at redhat.com <mailto:ksubrahm at redhat.com>>>> wrote:
>
>
>
> On Wed, Jun 21, 2017 at 1:56 PM, Xavier Hernandez
> <xhernandez at datalab.es
> <mailto:xhernandez at datalab.es> <mailto:xhernandez at datalab.es
> <mailto:xhernandez at datalab.es>>
> <mailto:xhernandez at datalab.es
> <mailto:xhernandez at datalab.es> <mailto:xhernandez at datalab.es
> <mailto:xhernandez at datalab.es>>>>
> wrote:
>
> That's ok. I'm currently unable to write a patch for
> this on ec.
>
> Sunil is working on this patch.
>
> ~Karthik
>
> If no one can do it, I can try to do it in 6 - 7
> hours...
>
> Xavi
>
>
> On Wednesday, June 21, 2017 09:48 CEST, Pranith
> Kumar
> Karampuri
> <pkarampu at redhat.com
> <mailto:pkarampu at redhat.com> <mailto:pkarampu at redhat.com
> <mailto:pkarampu at redhat.com>>
> <mailto:pkarampu at redhat.com <mailto:pkarampu at redhat.com>
> <mailto:pkarampu at redhat.com <mailto:pkarampu at redhat.com>>>> wrote:
>
>
>
> On Wed, Jun 21, 2017 at 1:00 PM, Xavier
> Hernandez
> <xhernandez at datalab.es
> <mailto:xhernandez at datalab.es>
> <mailto:xhernandez at datalab.es
> <mailto:xhernandez at datalab.es>> <mailto:xhernandez at datalab.es
> <mailto:xhernandez at datalab.es>
> <mailto:xhernandez at datalab.es
> <mailto:xhernandez at datalab.es>>>> wrote:
>
> I'm ok with reverting node-uuid content
> to the
> previous
> format and create a new xattr for the
> new format.
> Currently, only rebalance will use it.
>
> Only thing to consider is what can
> happen if we
> have a
> half upgraded cluster where some clients
> have
> this change
> and some not. Can rebalance work in this
> situation ? if
> so, could there be any issue ?
>
>
> I think there shouldn't be any problem,
> because this is
> in-memory xattr so layers below afr/ec will
> only see
> node-uuid
> xattr.
> This also gives us a chance to do whatever
> we want
> to do in
> future with this xattr without any problems
> about
> backward
> compatibility.
>
> You can check
>
>
> https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/src/afr-inode-read.c@1507
> <https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/src/afr-inode-read.c@1507>
>
> <https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/src/afr-inode-read.c@1507
> <https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/src/afr-inode-read.c@1507>>
>
>
> <https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/src/afr-inode-read.c@1507
> <https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/src/afr-inode-read.c@1507>
>
> <https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/src/afr-inode-read.c@1507
> <https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/src/afr-inode-read.c@1507>>>
> for how karthik implemented this in AFR
> (this got merged
> accidentally yesterday, but looks like this
> is what
> we are
> settling on)
>
>
>
> Xavi
>
>
> On Wednesday, June 21, 2017 06:56 CEST,
> Pranith
> Kumar
> Karampuri <pkarampu at redhat.com
> <mailto:pkarampu at redhat.com>
> <mailto:pkarampu at redhat.com
> <mailto:pkarampu at redhat.com>>
> <mailto:pkarampu at redhat.com
> <mailto:pkarampu at redhat.com>
> <mailto:pkarampu at redhat.com
> <mailto:pkarampu at redhat.com>>>> wrote:
>
>
>
> On Wed, Jun 21, 2017 at 10:07 AM, Nithya
> Balachandran
> <nbalacha at redhat.com
> <mailto:nbalacha at redhat.com>
> <mailto:nbalacha at redhat.com
> <mailto:nbalacha at redhat.com>> <mailto:nbalacha at redhat.com
> <mailto:nbalacha at redhat.com>
> <mailto:nbalacha at redhat.com
> <mailto:nbalacha at redhat.com>>>> wrote:
>
>
> On 20 June 2017 at 20:38, Aravinda
> <avishwan at redhat.com
> <mailto:avishwan at redhat.com>
> <mailto:avishwan at redhat.com
> <mailto:avishwan at redhat.com>> <mailto:avishwan at redhat.com
> <mailto:avishwan at redhat.com>
> <mailto:avishwan at redhat.com
> <mailto:avishwan at redhat.com>>>> wrote:
>
> On 06/20/2017 06:02 PM, Pranith
> Kumar Karampuri
> wrote:
>
> Xavi, Aravinda and I had a
> discussion on
> #gluster-dev and we
> agreed to go
> with the format
> Aravinda suggested for
> now and
> in future we
> wanted some more changes
> for dht
> to detect which
> subvolume went down came
> back
> up, at that time
> we will revisit the solution
> suggested by Xavi.
>
> Susanth is doing the dht
> changes
> Aravinda is doing
> geo-rep changes
>
> Done. Geo-rep patch sent for
> review
>
> https://review.gluster.org/17582 <https://review.gluster.org/17582>
> <https://review.gluster.org/17582
> <https://review.gluster.org/17582>>
>
> <https://review.gluster.org/17582 <https://review.gluster.org/17582>
> <https://review.gluster.org/17582
> <https://review.gluster.org/17582>>>
>
>
>
> The proposed changes to the
> node-uuid
> behaviour
> (while good) are going to break
> tiering
> . Tiering
> changes will take a little more
> time to
> be coded and
> tested.
>
> As this is a regression for 3.11
> and a
> blocker for
> 3.11.1, I suggest we go back to
> the original
> node-uuid behaviour for now so as to
> unblock the
> release and target the proposed
> changes
> for the next
> 3.11 releases.
>
>
> Let me see if I understand the changes
> correctly. We are
> restoring the behavior of node-uuid
> xattr
> and adding a
> new xattr for parallel rebalance for
> both
> afr and ec,
> correct? Otherwise that is one more
> regression. If yes,
> we will also wait for Xavi's inputs.
> Jeff
> accidentally
> merged the afr patch yesterday which
> does
> these changes.
> If everyone is in agreement, we will
> leave
> it as is and
> add similar changes in ec as well.
> If we are
> not in
> agreement, then we will let the
> discussion
> progress :-)
>
>
>
>
> Regards,
> Nithya
>
> --
> Aravinda
>
>
> Thanks to all of you
> guys for
> the discussions!
>
> On Tue, Jun 20, 2017 at
> 5:05 PM,
> Xavier
> Hernandez
> <xhernandez at datalab.es <mailto:xhernandez at datalab.es>
> <mailto:xhernandez at datalab.es
> <mailto:xhernandez at datalab.es>>
>
> <mailto:xhernandez at datalab.es <mailto:xhernandez at datalab.es>
> <mailto:xhernandez at datalab.es
> <mailto:xhernandez at datalab.es>>>> wrote:
>
> Hi Aravinda,
>
> On 20/06/17 12:42,
> Aravinda
> wrote:
>
> I think
> following format
> can be easily
> adopted by all
> components
>
> UUIDs of a
> subvolume are
> seperated by
> space and
> subvolumes are
> separated
> by comma
>
> For example,
> node1 and
> node2 are replica
> with U1 and U2 UUIDs
> respectively and
> node3 and node4 are
> replica with U3 and
> U4 UUIDs
> respectively
>
> node-uuid can
> return "U1
> U2,U3 U4"
>
>
> While this is ok for
> current
> implementation,
> I think this can be
> insufficient if there
> are more layers of
> xlators
> that require to
> indicate some sort of
> grouping. Some
> representation that can
> represent hierarchy
> would be better. For
> example: "(U1 U2) (U3
> U4)" (we can use
> spaces or
> comma as a
> separator).
>
>
>
> Geo-rep can
> split by ","
> and then split
> by space and
> take first UUID
> DHT can split
> the value
> by space or
> comma and get unique
> UUIDs list
>
>
> This doesn't solve the
> problem I described
> in the previous
> email. Some
> more logic will
> need to be added to
> avoid
> more than one node
> from each
> replica-set to be
> active. If we
> have some explicit
> hierarchy
> information in
> the node-uuid value,
> more
> decisions can be
> taken.
>
> An initial proposal
> I made
> was this:
>
>
> DHT[2](AFR[2,0](NODE(U1),
> NODE(U2)),
> AFR[2,0](NODE(U1),
> NODE(U2)))
>
> This is harder to
> parse, but
> gives a lot of
> information: DHT with 2
> subvolumes, each
> subvolume is an AFR with
> replica 2 and no
> arbiters. It's also
> easily
> extensible with
> any new xlator that
> changes
> the layout.
>
> However maybe this
> is not
> the moment to do
> this, and probably
> we could
> implement this
> in a new xattr with
> a better
> name.
>
> Xavi
>
>
>
> Another question is
> about the behavior
> when a node is down,
> existing
> node-uuid xattr
> will not
> return that
> UUID if a node
> is down.
> What is the
> behavior with the
> proposed xattr?
>
> Let me know your
> thoughts.
>
> regards
> Aravinda VK
>
> On 06/20/2017
> 03:06 PM,
> Aravinda wrote:
>
> Hi Xavi,
>
> On
> 06/20/2017 02:51
> PM, Xavier
> Hernandez wrote:
>
> Hi Aravinda,
>
> On 20/06/17
> 11:05, Pranith Kumar
>
> Karampuri wrote:
>
>
> Adding more
> people to get a
>
> consensus
> about this.
>
> On
> Tue, Jun
> 20, 2017 at 1:49
> PM,
> Aravinda
>
> <avishwan at redhat.com
> <mailto:avishwan at redhat.com> <mailto:avishwan at redhat.com
> <mailto:avishwan at redhat.com>>
>
> <mailto:avishwan at redhat.com
> <mailto:avishwan at redhat.com>
> <mailto:avishwan at redhat.com
> <mailto:avishwan at redhat.com>>>
>
> <mailto:avishwan at redhat.com
> <mailto:avishwan at redhat.com> <mailto:avishwan at redhat.com
> <mailto:avishwan at redhat.com>>
>
> <mailto:avishwan at redhat.com
> <mailto:avishwan at redhat.com>
> <mailto:avishwan at redhat.com
> <mailto:avishwan at redhat.com>>>>>
> wrote:
>
>
>
> regards
>
> Aravinda VK
>
>
> On
> 06/20/2017 01:26 PM,
> Xavier
> Hernandez wrote:
>
>
> Hi
> Pranith,
>
>
> adding
>
> gluster-devel, Kotresh and
>
> Aravinda,
>
>
> On
> 20/06/17 09:45,
> Pranith
> Kumar Karampuri wrote:
>
>
>
>
> On Tue, Jun 20,
> 2017
> at 1:12
> PM, Xavier
>
> Hernandez
>
>
> <xhernandez at datalab.es
> <mailto:xhernandez at datalab.es> <mailto:xhernandez at datalab.es
> <mailto:xhernandez at datalab.es>>
>
> <mailto:xhernandez at datalab.es
> <mailto:xhernandez at datalab.es>
> <mailto:xhernandez at datalab.es
> <mailto:xhernandez at datalab.es>>>
>
> <mailto:xhernandez at datalab.es
> <mailto:xhernandez at datalab.es>
> <mailto:xhernandez at datalab.es
> <mailto:xhernandez at datalab.es>>
> <mailto:xhernandez at datalab.es
> <mailto:xhernandez at datalab.es>
> <mailto:xhernandez at datalab.es
> <mailto:xhernandez at datalab.es>>>>
>
>
> <mailto:xhernandez at datalab.es
> <mailto:xhernandez at datalab.es>
> <mailto:xhernandez at datalab.es
> <mailto:xhernandez at datalab.es>>
> <mailto:xhernandez at datalab.es
> <mailto:xhernandez at datalab.es>
> <mailto:xhernandez at datalab.es
> <mailto:xhernandez at datalab.es>>>
>
>
> <mailto:xhernandez at datalab.es
> <mailto:xhernandez at datalab.es>
> <mailto:xhernandez at datalab.es
> <mailto:xhernandez at datalab.es>>
> <mailto:xhernandez at datalab.es
> <mailto:xhernandez at datalab.es>
> <mailto:xhernandez at datalab.es
> <mailto:xhernandez at datalab.es>>>>>>
> wrote:
>
>
> On 20/06/17
> 09:31,
> Pranith Kumar
>
> Karampuri wrote:
>
>
> The way
>
> geo-replication works is:
>
> On each
>
> machine, it
> does getxattr of
>
> node-uuid and
>
> check if its
>
> own uuid
>
> is
>
> present in
> the list. If it
> is
> present
> then it
>
> will consider
>
> it active
>
>
> otherwise it
> will be
>
> considered
> passive. With this
>
> change we are
>
> giving
>
> all
> uuids
> instead of first-up
>
> subvolume.
> So all
>
> machines think
>
> they are
>
> ACTIVE
>
> which is bad
> apparently. So
> that
> is the
>
> reason. Even I
>
> felt bad
>
> that we
> are
> doing
> this change.
>
>
>
> And what
> about
> changing the content
> of
> node-uuid to
>
> include some
>
> sort of
>
> hierarchy ?
>
>
> for example:
>
>
> a single brick:
>
>
> NODE(<guid>)
>
>
> AFR/EC:
>
>
>
> AFR[2](NODE(<guid>),
>
> NODE(<guid>))
>
>
> EC[3,1](NODE(<guid>),
>
> NODE(<guid>), NODE(<guid>))
>
>
> DHT:
>
>
>
> DHT[2](AFR[2](NODE(<guid>),
>
> NODE(<guid>)),
>
> AFR[2](NODE(<guid>),
>
> NODE(<guid>)))
>
>
> This gives a
> lot of
> information that can
> be
> used to
> take the
>
> appropriate
>
> decisions.
>
>
>
> I guess that is
> not
> backward
> compatible.
>
> Shall I CC
>
> gluster-devel and
>
> Kotresh/Aravinda?
>
>
>
> Is
> the change we did
> backward
> compatible ? if we
> only
> require
>
> the
> first field to
> be a
> GUID to
> support
> backward
> compatibility,
>
> we
> can use something
> like
> this:
>
>
> No. But
> the necessary
>
> change can
> be made to
>
> Geo-rep code
> as well if
>
> format
> is changed, Since
> all
> these
> are built/shipped
>
> together.
>
>
> Geo-rep
> uses node-id as
> follows,
>
>
> list =
> listxattr(node-uuid)
>
> active_node_uuids =
>
> list.split(SPACE)
>
> active_node_flag = True
> if
> self.node_id exists in
>
> active_node_uuids
>
> else False
>
>
> How was this
> case solved ?
>
> suppose
> we have
> three servers
> and 2
> bricks in
> each server. A
> replicated
> volume is created
> using the
> following command:
>
> gluster
> volume
> create test
> replica 2
> server1:/brick1
>
> server2:/brick1
>
> server2:/brick2
> server3:/brick1
>
> server3:/brick1
> server1:/brick2
>
> In this
> case we
> have three
>
> replica-sets:
>
> *
> server1:/brick1 server2:/brick1
> *
> server2:/brick2 server3:/brick1
> *
> server3:/brick2 server2:/brick2
>
> Old AFR
> implementation for
>
> node-uuid always
> returned the
> uuid of the
> node of the
> first brick, so in
> this case we
> will get the uuid
> of the
> three nodes
> because all of them
> are the
> first
> brick of a
> replica-set.
>
> Does
> this mean
> that with this
>
> configuration
> all nodes are
> active ? Is
> this a
> problem ?
> Is there any
> other
> check to
> avoid this
> situation if
> it's not
> good ?
>
> Yes all Geo-rep
> workers will become
> Active and
> participate in syncing.
> Since changelogs
> will have the same
> information in
> replica bricks this
> will lead to
> duplicate syncing and
> consuming
> network
> bandwidth.
>
> Node-uuid based
> Active worker is the
> default
> configuration in Geo-rep
> till now,
> Geo-rep
> also has Meta
> Volume based
> syncronization for Active
> worker using
> lock
> files.(Can be
> opted using
> Geo-rep
> configuration,
> with this config
> node-uuid will not
> be used)
>
> Kotresh
> proposed a
> solution to
> configure which
> worker to become
> Active. This
> will
> give more control
> to Admin to
> choose
> Active workers,
> This will
> become default
>
> configuration from 3.12
>
>
> https://github.com/gluster/glusterfs/issues/244
> <https://github.com/gluster/glusterfs/issues/244>
>
> <https://github.com/gluster/glusterfs/issues/244
> <https://github.com/gluster/glusterfs/issues/244>>
>
>
> <https://github.com/gluster/glusterfs/issues/244
> <https://github.com/gluster/glusterfs/issues/244>
>
> <https://github.com/gluster/glusterfs/issues/244
> <https://github.com/gluster/glusterfs/issues/244>>>
>
> --
> Aravinda
>
>
>
> Xavi
>
>
>
>
>
>
> Bricks:
>
>
> <guid>
>
>
> AFR/EC:
>
> <guid>(<guid>, <guid>)
>
>
> DHT:
>
>
> <guid>(<guid>(<guid>, ...),
>
> <guid>(<guid>, ...))
>
>
> In
> this case, AFR
> and
> EC would
> return the same
>
> <guid> they
>
> returned before the
>
> patch, but
> between '(' and
> ')'
> they put the
>
> full
> list of guid's
> of all
> nodes. The first
>
> <guid> can
> be used
>
> by
> geo-replication.
> The list
> after the first
>
> <guid> can
> be used
>
> for
> rebalance.
>
>
> Not
> sure if there's
> any
> user of
> node-uuid above DHT.
>
>
> Xavi
>
>
>
>
>
> Xavi
>
>
>
> On Tue,
> Jun
> 20, 2017
> at 12:46 PM,
>
> Xavier Hernandez
>
>
> <xhernandez at datalab.es
> <mailto:xhernandez at datalab.es> <mailto:xhernandez at datalab.es
> <mailto:xhernandez at datalab.es>>
>
> <mailto:xhernandez at datalab.es
> <mailto:xhernandez at datalab.es>
> <mailto:xhernandez at datalab.es
> <mailto:xhernandez at datalab.es>>>
>
>
> <mailto:xhernandez at datalab.es
> <mailto:xhernandez at datalab.es>
> <mailto:xhernandez at datalab.es
> <mailto:xhernandez at datalab.es>>
> <mailto:xhernandez at datalab.es
> <mailto:xhernandez at datalab.es>
> <mailto:xhernandez at datalab.es
> <mailto:xhernandez at datalab.es>>>>
>
> <mailto:xhernandez at datalab.es
> <mailto:xhernandez at datalab.es>
> <mailto:xhernandez at datalab.es
> <mailto:xhernandez at datalab.es>>
> <mailto:xhernandez at datalab.es
> <mailto:xhernandez at datalab.es>
> <mailto:xhernandez at datalab.es
> <mailto:xhernandez at datalab.es>>>
>
>
> <mailto:xhernandez at datalab.es
> <mailto:xhernandez at datalab.es>
> <mailto:xhernandez at datalab.es
> <mailto:xhernandez at datalab.es>>
> <mailto:xhernandez at datalab.es
> <mailto:xhernandez at datalab.es>
> <mailto:xhernandez at datalab.es
> <mailto:xhernandez at datalab.es>>>>>
>
>
> <mailto:xhernandez at datalab.es
> <mailto:xhernandez at datalab.es>
> <mailto:xhernandez at datalab.es
> <mailto:xhernandez at datalab.es>>
> <mailto:xhernandez at datalab.es
> <mailto:xhernandez at datalab.es>
> <mailto:xhernandez at datalab.es
> <mailto:xhernandez at datalab.es>>>
>
>
> <mailto:xhernandez at datalab.es
> <mailto:xhernandez at datalab.es>
> <mailto:xhernandez at datalab.es
> <mailto:xhernandez at datalab.es>>
> <mailto:xhernandez at datalab.es
> <mailto:xhernandez at datalab.es>
> <mailto:xhernandez at datalab.es
> <mailto:xhernandez at datalab.es>>>>
>
> <mailto:xhernandez at datalab.es
> <mailto:xhernandez at datalab.es>
> <mailto:xhernandez at datalab.es
> <mailto:xhernandez at datalab.es>>
> <mailto:xhernandez at datalab.es
> <mailto:xhernandez at datalab.es>
> <mailto:xhernandez at datalab.es
> <mailto:xhernandez at datalab.es>>>
>
>
> <mailto:xhernandez at datalab.es
> <mailto:xhernandez at datalab.es>
> <mailto:xhernandez at datalab.es
> <mailto:xhernandez at datalab.es>>
> <mailto:xhernandez at datalab.es
> <mailto:xhernandez at datalab.es>
> <mailto:xhernandez at datalab.es
> <mailto:xhernandez at datalab.es>>>>>>>
>
>
> wrote:
>
>
> Hi
> Pranith,
>
>
> On
> 20/06/17
> 07:53, Pranith
>
> Kumar Karampuri
> wrote:
>
>
> hi Xavi,
>
>
> We
> all made the
>
> mistake of not
>
> sending about
> changing
>
>
> behavior of
>
>
> node-uuid
> xattr so that
>
> rebalance
> can use
>
> multiple nodes
>
> for doing
>
>
> rebalance.
> Because of this
> on
> geo-rep all
>
> the workers
>
> are becoming
>
> active
> instead of one per
> EC/AFR
> subvolume.
>
> So we are
>
>
> frantically
> trying
>
> to
> restore
> the functionality
> of
> node-uuid
>
> and introduce
>
> a new
>
>
> xattr for
>
>
>
>
> --
> Pranith
More information about the Gluster-devel
mailing list