[Gluster-devel] Feature review: Improved rebalance performance

Mon Jun 30 20:18:09 UTC 2014

> From: "Xavier Hernandez" <xhernandez at datalab.es>
> 
> Hi Shyam,
> 
> On Thursday 26 June 2014 14:41:13 Shyamsundar Ranganathan wrote:
> > It also touches upon a rebalance on access like mechanism where we could
> > potentially, move data out of existing bricks to a newer brick faster, in
> > the case of brick addition, and vice versa for brick removal, and heal the
> > rest of the data on access.
> > 
> Will this "rebalance on access" feature be enabled always or only during a
> brick addition/removal to move files that do not go to the affected brick
> while the main rebalance is populating or removing files from the brick ?

The rebalance on access, in my head, stands as follows, (a little more detailed than what is in the feature page)
Step 1: Initiation of the process
- Admin chooses to "rebalance _changed_" bricks
  - This could mean added/removed/changed size bricks
[3]- Rebalance on access is triggered, so as to move files when they are accessed but asynchronously
[1]- Background rebalance, acts only to (re)move data (from)to these bricks
  [2]- This would also change the layout for all directories, to include the new configuration of the cluster, so that newer data is placed in the correct bricks

Step 2: Completion of background rebalance
- Once background rebalance is complete, the rebalance status is noted as success/failure based on what the backgrould rebalance process did
- This will not stop the on access rebalance, as data is still all over the place, and enhancements like lookup-unhashed=auto will have trouble

Step 3: Admin can initiate a full rebalance
- When this is complete then the on access rebalance would be turned off, as the cluster is rebalanced!

Step 2.5/4: Choosing to stop the on access rebalance
- This can be initiated by the admin, post 3 which is more logical or between 2 and 3, in which case lookup everywhere for files etc. cannot be avoided due to [2] above

Issues and possible solutions:

[4] One other thought is to create link files, as a part of [1], for files that do not belong to the right bricks but are _not_ going to be rebalanced as their source/destination is not a changed brick. This _should_ be faster than moving data around and rebalancing these files. It should also avoid the problem that, post a "rebalance _changed_" command, the cluster may have files in the wrong place based on the layout, as the link files would be present to correct the situation. In this situation the rebalance on access can be left on indefinitely and turning it off does not serve much purpose.

Enabling rebalance on access always is fine, but I am not sure it buys us gluster states that mean the cluster is in a balanced situation, for other actions like the lookup-unhashed mentioned which may not just need the link files in place. Examples could be mismatched or overly space committed bricks with old, not accessed data etc. but do not have a clear example yet.

Just stating, the core intention of "rebalance _changed_" is to create space in existing bricks when the cluster grows faster, or be able to remove bricks from the cluster faster.

Redoing a "rebalance _changed_" again due to a gluster configuration change, i.e expanding the cluster again say, needs some thought. It does not impact if rebalance on access is running or not, the only thing it may impact is the choice of files that are already put into the on access queue based on the older layout, due to the older cluster configuration. Just noting this here.

In short if we do [4] then we can leave rebalance on access turned on always, unless we have some other counter examples or use cases that are not thought of. Doing [4] seems logical, so I would state that we should, but from a performance angle of improving rebalance, we need to determine the worth against access paths from IO post not having [4] (again considering the improvement that lookup-unhashed brings, this maybe obvious that [4] should be done).

A note on [3], the intention is to start an asynchronous sync task that rebalances the file on access, and not impact the IO path. So if a file is chosen by the IO path as to needing a rebalance, then a sync task with the required xattr to trigger a file move is setup, and setxattr is called, that should take care of the file migration and enabling the IO path to progress as is.

Reading through your mail, a better way of doing this by sharing the load, would be to use an index, so that each node in the cluster has a list of files accessed that need a rebalance. The above method for [3] would be client heavy and would incur a network read and write, whereas the index manner of doing things on the node could help in local reads and remote writes operations and in spreading the work. It would incur a walk/crawl of the index, but each entry returned is a candidate, and the walk is limited, so should not be a bad thing by itself.

> 
> I like all the proposed ideas. I think they would improve the performance of
> the rebalance operation considerably. Probably we will need to define some
> policies to limit the amount of bandwidth that rebalance is allowed to use
> and
> at which hours, but this can be determined later.

This [5] section of the feature page touches upon the same issue. i.e being IO path requirements aware and not let rebalance hog the node resources. But as you state, needs more thought and probably to be done once we see some improvements and also see that we are utilizing the resources heavily.

> 
> I would also consider using index or changelog xlators to track renames and
> let rebalance consume it. Currently a file or directory rename makes that
> files correctly placed in the right brick need to be moved to another brick.
> A
> full rebalance crawling all the file system seems too expensive for this kind
> of local changes (the effects of this are orders of magnitude smaller than
> adding or removing a brick). Having a way to list pending moves due to rename
> without scanning all the file system would be great.

Hmmm... to my knowledge a rename of a file does not move the file, it rather creates a link file if the hashed sub volume of the new name is different than the older sub volume where the file was placed. the rename of a directory does not change its layout (unless 'a still to be analyzed' lookup races with the rename for layout fetching and healing). On any future layout fixes due to added bricks or removed bricks, the layout overlaps are computed so as to minimize data movements.

Are you suggesting a change in behavior here, or am I missing something?

> 
> Another thing to consider for future versions is to modify the current DHT to
> a consistent hashing and even the hash value (using gfid instead of a hash of
> the name would solve the rename problem). The consistent hashing would
> drastically reduce the number of files that need to be moved and already
> solves some of the current problems. This change needs a lot of thinking
> though.

Firstly agree that this is an area to explore and nail better in the _hopefully_ near future and that it takes some thinking time to get this straight, while learning from the current implementation.

Also, would like to point out to a commit that changes this for directories using the GFID based hash rather than the name based hash, here [6]. It does not address the rename problem, but starts to do things that you put down here.

> 
> Xavi
> 
> 

[5] http://www.gluster.org/community/documentation/index.php/Features/improve_rebalance_performance#Make_rebalance_aware_of_IO_path_requirements
[6] http://review.gluster.org/#/c/7493/

Shyam