[Gluster-devel] Feature review: Improved rebalance performance

Tue Jul 1 09:01:23 UTC 2014

----- Original Message -----
> From: "Shyamsundar Ranganathan" <srangana at redhat.com>
> To: "Xavier Hernandez" <xhernandez at datalab.es>
> Cc: gluster-devel at gluster.org
> Sent: Tuesday, July 1, 2014 1:48:09 AM
> Subject: Re: [Gluster-devel] Feature review: Improved rebalance performance
> 
> > From: "Xavier Hernandez" <xhernandez at datalab.es>
> > 
> > Hi Shyam,
> > 
> > On Thursday 26 June 2014 14:41:13 Shyamsundar Ranganathan wrote:
> > > It also touches upon a rebalance on access like mechanism where we could
> > > potentially, move data out of existing bricks to a newer brick faster, in
> > > the case of brick addition, and vice versa for brick removal, and heal
> > > the
> > > rest of the data on access.
> > > 
> > Will this "rebalance on access" feature be enabled always or only during a
> > brick addition/removal to move files that do not go to the affected brick
> > while the main rebalance is populating or removing files from the brick ?
> 
> The rebalance on access, in my head, stands as follows, (a little more
> detailed than what is in the feature page)
> Step 1: Initiation of the process
> - Admin chooses to "rebalance _changed_" bricks
>   - This could mean added/removed/changed size bricks
> [3]- Rebalance on access is triggered, so as to move files when they are
> accessed but asynchronously
> [1]- Background rebalance, acts only to (re)move data (from)to these bricks
>   [2]- This would also change the layout for all directories, to include the
>   new configuration of the cluster, so that newer data is placed in the
>   correct bricks
> 
> Step 2: Completion of background rebalance
> - Once background rebalance is complete, the rebalance status is noted as
> success/failure based on what the backgrould rebalance process did
> - This will not stop the on access rebalance, as data is still all over the
> place, and enhancements like lookup-unhashed=auto will have trouble
> 
> Step 3: Admin can initiate a full rebalance
> - When this is complete then the on access rebalance would be turned off, as
> the cluster is rebalanced!
> 
> Step 2.5/4: Choosing to stop the on access rebalance
> - This can be initiated by the admin, post 3 which is more logical or between
> 2 and 3, in which case lookup everywhere for files etc. cannot be avoided
> due to [2] above
> 
> Issues and possible solutions:
> 
> [4] One other thought is to create link files, as a part of [1], for files
> that do not belong to the right bricks but are _not_ going to be rebalanced
> as their source/destination is not a changed brick. This _should_ be faster
> than moving data around and rebalancing these files. It should also avoid
> the problem that, post a "rebalance _changed_" command, the cluster may have
> files in the wrong place based on the layout, as the link files would be
> present to correct the situation. In this situation the rebalance on access
> can be left on indefinitely and turning it off does not serve much purpose.
> 
> Enabling rebalance on access always is fine, but I am not sure it buys us
> gluster states that mean the cluster is in a balanced situation, for other
> actions like the lookup-unhashed mentioned which may not just need the link
> files in place. Examples could be mismatched or overly space committed
> bricks with old, not accessed data etc. but do not have a clear example yet.
> 
> Just stating, the core intention of "rebalance _changed_" is to create space
> in existing bricks when the cluster grows faster, or be able to remove
> bricks from the cluster faster.
> 
> Redoing a "rebalance _changed_" again due to a gluster configuration change,
> i.e expanding the cluster again say, needs some thought. It does not impact
> if rebalance on access is running or not, the only thing it may impact is
> the choice of files that are already put into the on access queue based on
> the older layout, due to the older cluster configuration. Just noting this
> here.
> 
> In short if we do [4] then we can leave rebalance on access turned on always,
> unless we have some other counter examples or use cases that are not thought
> of. Doing [4] seems logical, so I would state that we should, but from a
> performance angle of improving rebalance, we need to determine the worth
> against access paths from IO post not having [4] (again considering the
> improvement that lookup-unhashed brings, this maybe obvious that [4] should
> be done).
> 
> A note on [3], the intention is to start an asynchronous sync task that
> rebalances the file on access, and not impact the IO path. So if a file is
> chosen by the IO path as to needing a rebalance, then a sync task with the
> required xattr to trigger a file move is setup, and setxattr is called, that
> should take care of the file migration and enabling the IO path to progress
> as is.
> 
> Reading through your mail, a better way of doing this by sharing the load,
> would be to use an index, so that each node in the cluster has a list of
> files accessed that need a rebalance. The above method for [3] would be
> client heavy and would incur a network read and write, whereas the index
> manner of doing things on the node could help in local reads and remote
> writes operations and in spreading the work. It would incur a walk/crawl of
> the index, but each entry returned is a candidate, and the walk is limited,
> so should not be a bad thing by itself.
> 
> > 
> > I like all the proposed ideas. I think they would improve the performance
> > of
> > the rebalance operation considerably. Probably we will need to define some
> > policies to limit the amount of bandwidth that rebalance is allowed to use
> > and
> > at which hours, but this can be determined later.
> 
> This [5] section of the feature page touches upon the same issue. i.e being
> IO path requirements aware and not let rebalance hog the node resources. But
> as you state, needs more thought and probably to be done once we see some
> improvements and also see that we are utilizing the resources heavily.
> 
> > 
> > I would also consider using index or changelog xlators to track renames and
> > let rebalance consume it. Currently a file or directory rename makes that
> > files correctly placed in the right brick need to be moved to another
> > brick.
> > A
> > full rebalance crawling all the file system seems too expensive for this
> > kind
> > of local changes (the effects of this are orders of magnitude smaller than
> > adding or removing a brick). Having a way to list pending moves due to
> > rename
> > without scanning all the file system would be great.
> 
> Hmmm... to my knowledge a rename of a file does not move the file, it rather
> creates a link file if the hashed sub volume of the new name is different
> than the older sub volume where the file was placed. the rename of a
> directory does not change its layout (unless 'a still to be analyzed' lookup
> races with the rename for layout fetching and healing). On any future layout
> fixes due to added bricks or removed bricks, the layout overlaps are
> computed so as to minimize data movements.
> 
> Are you suggesting a change in behavior here, or am I missing something?
> 
> > 
> > Another thing to consider for future versions is to modify the current DHT
> > to
> > a consistent hashing and even the hash value (using gfid instead of a hash
> > of
> > the name would solve the rename problem). The consistent hashing would
> > drastically reduce the number of files that need to be moved and already
> > solves some of the current problems. This change needs a lot of thinking
> > though.
> 
> Firstly agree that this is an area to explore and nail better in the
> _hopefully_ near future and that it takes some thinking time to get this
> straight, while learning from the current implementation.
> 
> Also, would like to point out to a commit that changes this for directories
> using the GFID based hash rather than the name based hash, here [6].

I don't think this is what xavi meant. This only changes how hash-ranges are distributed across subvolumes. To decide which subvolume a file goes, we still hash on the name. We cannot use gfid, for the reasons I've pointed out in another mail.

> It does
> not address the rename problem, but starts to do things that you put down
> here.
> 
> > 
> > Xavi
> > 
> > 
> 
> [5]
> http://www.gluster.org/community/documentation/index.php/Features/improve_rebalance_performance#Make_rebalance_aware_of_IO_path_requirements
> [6] http://review.gluster.org/#/c/7493/
> 
> Shyam
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at gluster.org
> http://supercolony.gluster.org/mailman/listinfo/gluster-devel
>