[Gluster-devel] Feature review: Improved rebalance performance

Shyamsundar Ranganathan srangana at redhat.com
Tue Jul 1 14:59:08 UTC 2014


> From: "Xavier Hernandez" <xhernandez at datalab.es>
> On Monday 30 June 2014 16:18:09 Shyamsundar Ranganathan wrote:
> > > Will this "rebalance on access" feature be enabled always or only during
> > > a
> > > brick addition/removal to move files that do not go to the affected brick
> > > while the main rebalance is populating or removing files from the brick ?
> > 
> > The rebalance on access, in my head, stands as follows, (a little more
> > detailed than what is in the feature page) Step 1: Initiation of the
> > process
> > - Admin chooses to "rebalance _changed_" bricks
> >   - This could mean added/removed/changed size bricks
> > [3]- Rebalance on access is triggered, so as to move files when they are
> > accessed but asynchronously [1]- Background rebalance, acts only to
> > (re)move data (from)to these bricks [2]- This would also change the layout
> > for all directories, to include the new configuration of the cluster, so
> > that newer data is placed in the correct bricks
> > 
> > Step 2: Completion of background rebalance
> > - Once background rebalance is complete, the rebalance status is noted as
> > success/failure based on what the backgrould rebalance process did - This
> > will not stop the on access rebalance, as data is still all over the place,
> > and enhancements like lookup-unhashed=auto will have trouble
> 
> I don't see why stopping rebalance on access when lookup-unhashed=auto is a
> problem. If I understand http://review.gluster.org/7702/ correctly, when the
> directory commit hash does not match that of the volume root, a global lookup
> will be made. If we change layout in [3], it will also change (or it should)
> the commit of the directory. This means that even if files of that directory
> are not rebalanced yet, they will be found regardless if on access rebalance
> is enabled or not.
> 
> Am I missing something ?

The comment was more to state that, the speed up gained by lookup-unhashed would be lost for the time that the cluster is not rebalanced completely, or has not noted all redirection as link files. The feature will work, but sub-optimally, and we need to consider/reduce the time for which this sub-optimal behavior is in effect.

> 
> > 
> > Step 3: Admin can initiate a full rebalance
> > - When this is complete then the on access rebalance would be turned off,
> > as
> > the cluster is rebalanced!
> > 
> > Step 2.5/4: Choosing to stop the on access rebalance
> > - This can be initiated by the admin, post 3 which is more logical or
> > between 2 and 3, in which case lookup everywhere for files etc. cannot be
> > avoided due to [2] above
> > 
> 
> I like having the possibility for admins to enable/disable this feature seems
> interesting. However I also think this should be forcibly enabled when
> rebalancing _changed_ bricks.

Yes, when rebalance _changed_ is in effect the rebalance on access is also in effect, noted in Step 1 of the elaboration above.

> 
> > Issues and possible solutions:
> > 
> > [4] One other thought is to create link files, as a part of [1], for files
> > that do not belong to the right bricks but are _not_ going to be rebalanced
> > as their source/destination is not a changed brick. This _should_ be faster
> > than moving data around and rebalancing these files. It should also avoid
> > the problem that, post a "rebalance _changed_" command, the cluster may
> > have files in the wrong place based on the layout, as the link files would
> > be present to correct the situation. In this situation the rebalance on
> > access can be left on indefinitely and turning it off does not serve much
> > purpose.
> > 
> 
> I think that creating link files is a cheap task, specially if rebalance will
> handle files in parallel. However I'm not sure if this will make any
> measurable difference in performance on future accesses (in theory it should
> avoid a global lookup once). This would need to be tested to decide.

It would also avoid global lookup on create of new files when lookup-unhashed=auto is in force, so you find the file in the hashed subvol or not during creates to report EEXIST errors (as needed).

For a existing file lookup, yes the link file creation is triggered on the first lookup, which would do a global lookup, against the rebalance process ensuring these link files are present. Overall, it is better to have the link files created, so that create and existing lookups do not suffer the time and resource penalties is my thought.

> 
> > Enabling rebalance on access always is fine, but I am not sure it buys us
> > gluster states that mean the cluster is in a balanced situation, for other
> > actions like the lookup-unhashed mentioned which may not just need the link
> > files in place. Examples could be mismatched or overly space committed
> > bricks with old, not accessed data etc. but do not have a clear example
> > yet.
> > 
> 
> As I see it, rebalance on access should be a complement to normal rebalance
> to
> keep the volume _more_ balanced (keep accessed files on the right brick to
> avoid unnecessary delays due to global lookups or link file redirections),
> but
> it can not assure that the volume is fully rebalanced.

True, except in the case where we ensure link files are created during rebalance _changed_. 

If we ensure link files are all present, I think this is true always except till this is realized, but needs through validation before suggesting the same :)

> 
> > Just stating, the core intention of "rebalance _changed_" is to create
> > space
> > in existing bricks when the cluster grows faster, or be able to remove
> > bricks from the cluster faster.
> > 
> 
> That is a very important feature. I've missed it several times when expanding
> a volume. In fact we needed to write some scripts to do something similar
> before launching a full rebalance.

The idea of rebalance _changed_ is to just do this. Drain bricks of data that belongs to a newly added brick leaving other data as is, with the additional step as discussed in this mail of adding link files for these non-migrated files. So a script could be a shorter term solution, with the rebalance deamon handling the longer term problems, again depends on effort to get this going.

> 
> > Redoing a "rebalance _changed_" again due to a gluster configuration
> > change,
> > i.e expanding the cluster again say, needs some thought. It does not impact
> > if rebalance on access is running or not, the only thing it may impact is
> > the choice of files that are already put into the on access queue based on
> > the older layout, due to the older cluster configuration. Just noting this
> > here.
> > 
> 
> This will need to be thought more deeply, but if we only have a queue of
> files
> that *may* need migration, and we really check the target volume at the time
> of migration, I think this won't pose much problem in case of successive
> rebalances.

Yes, or create an index head node/directory based on rebalance IDs and trash older index heads as we progress. Could be other mechanisms as well.

> 
> > In short if we do [4] then we can leave rebalance on access turned on
> > always, unless we have some other counter examples or use cases that are
> > not thought of. Doing [4] seems logical, so I would state that we should,
> > but from a performance angle of improving rebalance, we need to determine
> > the worth against access paths from IO post not having [4] (again
> > considering the improvement that lookup-unhashed brings, this maybe obvious
> > that [4] should be done).
> > 
> > A note on [3], the intention is to start an asynchronous sync task that
> > rebalances the file on access, and not impact the IO path. So if a file is
> > chosen by the IO path as to needing a rebalance, then a sync task with the
> > required xattr to trigger a file move is setup, and setxattr is called,
> > that should take care of the file migration and enabling the IO path to
> > progress as is.
> > 
> 
> Agreed. The file operation that triggered it must not be blocked while
> migration is performed.
> 
> > Reading through your mail, a better way of doing this by sharing the load,
> > would be to use an index, so that each node in the cluster has a list of
> > files accessed that need a rebalance. The above method for [3] would be
> > client heavy and would incur a network read and write, whereas the index
> > manner of doing things on the node could help in local reads and remote
> > writes operations and in spreading the work. It would incur a walk/crawl of
> > the index, but each entry returned is a candidate, and the walk is limited,
> > so should not be a bad thing by itself.
> 
> The idea of using index was more intended to easily detect renamed files on
> an
> otherwise balanced volume, and be able to perform quick rebalance operations
> to move them to the correct brick without having to crawl the entire file
> system. On almost all cases, all files present in the index will need
> rebalance, so the cost of crawling the index is worth it.
> 
> As I thought it, it was independent of the on access rebalance. However, it
> could be seen as something similar to the self-heal daemon. We could consider
> that a file not residing in the right brick is not healthy and initiate some
> sort of self-heal on it. Not sure if this should/could be done in the self-
> heal daemon or would need another daemon though.
> 
> Using the daemon solution, I think that the client side "on access rebalance"
> is not needed. However I'm not sure which one is easier to implement.

Implementation wise, I am not yet clear as well on which side is quicker to achieve, I thought the client side was easier as I had this running in my head, till you and Raghavendara in another conversation brought index for different reasons, and my thought changed. Lets think this through and see what fits the bill (time and complexity wise I mean).

> 
> > > I like all the proposed ideas. I think they would improve the performance
> > > of the rebalance operation considerably. Probably we will need to define
> > > some policies to limit the amount of bandwidth that rebalance is allowed
> > > to use and
> > > at which hours, but this can be determined later.
> > 
> > This [5] section of the feature page touches upon the same issue. i.e being
> > IO path requirements aware and not let rebalance hog the node resources.
> > But as you state, needs more thought and probably to be done once we see
> > some improvements and also see that we are utilizing the resources heavily.
> > > I would also consider using index or changelog xlators to track renames
> > > and
> > > let rebalance consume it. Currently a file or directory rename makes that
> > > files correctly placed in the right brick need to be moved to another
> > > brick. A
> > > full rebalance crawling all the file system seems too expensive for this
> > > kind of local changes (the effects of this are orders of magnitude
> > > smaller than adding or removing a brick). Having a way to list pending
> > > moves due to rename without scanning all the file system would be great.
> > 
> > Hmmm... to my knowledge a rename of a file does not move the file, it
> > rather
> > creates a link file if the hashed sub volume of the new name is different
> > than the older sub volume where the file was placed. the rename of a
> > directory does not change its layout (unless 'a still to be analyzed'
> > lookup races with the rename for layout fetching and healing). On any
> > future layout fixes due to added bricks or removed bricks, the layout
> > overlaps are computed so as to minimize data movements.
> > 
> > Are you suggesting a change in behavior here, or am I missing something?
> 
> Not really. I'm only considering the possibility of adding an additional
> step.
> The way rename works now is fine as it is now. I think that creating a link
> file is the most efficient way to be able to easily find the file in the
> future without wasting too much bandwidth and IOPS. However, as more and more
> file and directory renames are made, more and more data is left on the wrong
> brick and each access needs an additional jump. Even if this were cheap, a
> future layout change trying to minimize data movements will not be optimum
> because data is not really where it thinks it is.
> 
> Recording all renames in an index each time a rename is made can allow a
> background daemon to scan it and incrementally process them to restore volume
> balance.

Ok, understood. A additional or possibly default case for rebalance on access to handle.

> 
> > > Another thing to consider for future versions is to modify the current
> > > DHT
> > > to a consistent hashing and even the hash value (using gfid instead of a
> > > hash of the name would solve the rename problem). The consistent hashing
> > > would drastically reduce the number of files that need to be moved and
> > > already solves some of the current problems. This change needs a lot of
> > > thinking though.
> > 
> > Firstly agree that this is an area to explore and nail better in the
> > _hopefully_ near future and that it takes some thinking time to get this
> > straight, while learning from the current implementation.
> > 
> > Also, would like to point out to a commit that changes this for directories
> > using the GFID based hash rather than the name based hash, here [6]. It
> > does not address the rename problem, but starts to do things that you put
> > down here.
> 
> That's good. I missed this patch. I'll look at it. Thanks :)

Like Raghavendra, points out, this is not in reality a realization of what you are stating, but a bit that chooses the GFID for directory layout assignment for a unnamed lookup case and maybe interesting as a part of this conversation.

> 
> Xavi

BTW, thanks for the comments and the discussion.

Shyam


More information about the Gluster-devel mailing list