[Gluster-devel] Feature review: Improved rebalance performance

Thu Jul 3 11:07:53 UTC 2014

On Tue, Jul 1, 2014 at 2:23 PM, Xavier Hernandez <xhernandez at datalab.es>
wrote:

> On Monday 30 June 2014 16:18:09 Shyamsundar Ranganathan wrote:
> > > Will this "rebalance on access" feature be enabled always or only
> during a
> > > brick addition/removal to move files that do not go to the affected
> brick
> > > while the main rebalance is populating or removing files from the
> brick ?
> >
> > The rebalance on access, in my head, stands as follows, (a little more
> > detailed than what is in the feature page) Step 1: Initiation of the
> > process
> > - Admin chooses to "rebalance _changed_" bricks
> >   - This could mean added/removed/changed size bricks
> > [3]- Rebalance on access is triggered, so as to move files when they are
> > accessed but asynchronously [1]- Background rebalance, acts only to
> > (re)move data (from)to these bricks [2]- This would also change the
> layout
> > for all directories, to include the new configuration of the cluster, so
> > that newer data is placed in the correct bricks
> >
> > Step 2: Completion of background rebalance
> > - Once background rebalance is complete, the rebalance status is noted as
> > success/failure based on what the backgrould rebalance process did - This
> > will not stop the on access rebalance, as data is still all over the
> place,
> > and enhancements like lookup-unhashed=auto will have trouble
>
> I don't see why stopping rebalance on access when lookup-unhashed=auto is a
> problem. If I understand http://review.gluster.org/7702/ correctly, when
> the
> directory commit hash does not match that of the volume root, a global
> lookup
> will be made. If we change layout in [3], it will also change (or it
> should)
> the commit of the directory. This means that even if files of that
> directory
> are not rebalanced yet, they will be found regardless if on access
> rebalance
> is enabled or not.
>
> Am I missing something ?
>
> >
> > Step 3: Admin can initiate a full rebalance
> > - When this is complete then the on access rebalance would be turned
> off, as
> > the cluster is rebalanced!
> >
> > Step 2.5/4: Choosing to stop the on access rebalance
> > - This can be initiated by the admin, post 3 which is more logical or
> > between 2 and 3, in which case lookup everywhere for files etc. cannot be
> > avoided due to [2] above
> >
>
> I like having the possibility for admins to enable/disable this feature
> seems
> interesting. However I also think this should be forcibly enabled when
> rebalancing _changed_ bricks.
>
> > Issues and possible solutions:
> >
> > [4] One other thought is to create link files, as a part of [1], for
> files
> > that do not belong to the right bricks but are _not_ going to be
> rebalanced
> > as their source/destination is not a changed brick. This _should_ be
> faster
> > than moving data around and rebalancing these files. It should also avoid
> > the problem that, post a "rebalance _changed_" command, the cluster may
> > have files in the wrong place based on the layout, as the link files
> would
> > be present to correct the situation. In this situation the rebalance on
> > access can be left on indefinitely and turning it off does not serve much
> > purpose.
> >
>
> I think that creating link files is a cheap task, specially if rebalance
> will
> handle files in parallel. However I'm not sure if this will make any
> measurable difference in performance on future accesses (in theory it
> should
> avoid a global lookup once). This would need to be tested to decide.
>
> > Enabling rebalance on access always is fine, but I am not sure it buys us
> > gluster states that mean the cluster is in a balanced situation, for
> other
> > actions like the lookup-unhashed mentioned which may not just need the
> link
> > files in place. Examples could be mismatched or overly space committed
> > bricks with old, not accessed data etc. but do not have a clear example
> > yet.
> >
>
> As I see it, rebalance on access should be a complement to normal
> rebalance to
> keep the volume _more_ balanced (keep accessed files on the right brick to
> avoid unnecessary delays due to global lookups or link file redirections),
> but
> it can not assure that the volume is fully rebalanced.
>
> > Just stating, the core intention of "rebalance _changed_" is to create
> space
> > in existing bricks when the cluster grows faster, or be able to remove
> > bricks from the cluster faster.
> >
>
> That is a very important feature. I've missed it several times when
> expanding
> a volume. In fact we needed to write some scripts to do something similar
> before launching a full rebalance.
>
> > Redoing a "rebalance _changed_" again due to a gluster configuration
> change,
> > i.e expanding the cluster again say, needs some thought. It does not
> impact
> > if rebalance on access is running or not, the only thing it may impact is
> > the choice of files that are already put into the on access queue based
> on
> > the older layout, due to the older cluster configuration. Just noting
> this
> > here.
> >
>
> This will need to be thought more deeply, but if we only have a queue of
> files
> that *may* need migration, and we really check the target volume at the
> time
> of migration, I think this won't pose much problem in case of successive
> rebalances.
>
> > In short if we do [4] then we can leave rebalance on access turned on
> > always, unless we have some other counter examples or use cases that are
> > not thought of. Doing [4] seems logical, so I would state that we should,
> > but from a performance angle of improving rebalance, we need to determine
> > the worth against access paths from IO post not having [4] (again
> > considering the improvement that lookup-unhashed brings, this maybe
> obvious
> > that [4] should be done).
> >
> > A note on [3], the intention is to start an asynchronous sync task that
> > rebalances the file on access, and not impact the IO path. So if a file
> is
> > chosen by the IO path as to needing a rebalance, then a sync task with
> the
> > required xattr to trigger a file move is setup, and setxattr is called,
> > that should take care of the file migration and enabling the IO path to
> > progress as is.
> >
>
> Agreed. The file operation that triggered it must not be blocked while
> migration is performed.
>
> > Reading through your mail, a better way of doing this by sharing the
> load,
> > would be to use an index, so that each node in the cluster has a list of
> > files accessed that need a rebalance. The above method for [3] would be
> > client heavy and would incur a network read and write, whereas the index
> > manner of doing things on the node could help in local reads and remote
> > writes operations and in spreading the work. It would incur a walk/crawl
> of
> > the index, but each entry returned is a candidate, and the walk is
> limited,
> > so should not be a bad thing by itself.
>
> The idea of using index was more intended to easily detect renamed files
> on an
> otherwise balanced volume, and be able to perform quick rebalance
> operations
> to move them to the correct brick without having to crawl the entire file
> system. On almost all cases, all files present in the index will need
> rebalance, so the cost of crawling the index is worth it.
>

We did consider using index for identifying files that need migration. In
the normal case it suits our needs. However, after an add-brick we cannot
rely on index to avoid crawl, since layout itself would've been changed.

>
> As I thought it, it was independent of the on access rebalance. However, it
> could be seen as something similar to the self-heal daemon. We could
> consider
> that a file not residing in the right brick is not healthy and initiate
> some
> sort of self-heal on it. Not sure if this should/could be done in the self-
> heal daemon or would need another daemon though.
>
> Using the daemon solution, I think that the client side "on access
> rebalance"
> is not needed. However I'm not sure which one is easier to implement.
>
> > > I like all the proposed ideas. I think they would improve the
> performance
> > > of the rebalance operation considerably. Probably we will need to
> define
> > > some policies to limit the amount of bandwidth that rebalance is
> allowed
> > > to use and
> > > at which hours, but this can be determined later.
> >
> > This [5] section of the feature page touches upon the same issue. i.e
> being
> > IO path requirements aware and not let rebalance hog the node resources.
> > But as you state, needs more thought and probably to be done once we see
> > some improvements and also see that we are utilizing the resources
> heavily.
> > > I would also consider using index or changelog xlators to track renames
> > > and
> > > let rebalance consume it. Currently a file or directory rename makes
> that
> > > files correctly placed in the right brick need to be moved to another
> > > brick. A
> > > full rebalance crawling all the file system seems too expensive for
> this
> > > kind of local changes (the effects of this are orders of magnitude
> > > smaller than adding or removing a brick). Having a way to list pending
> > > moves due to rename without scanning all the file system would be
> great.
> >
> > Hmmm... to my knowledge a rename of a file does not move the file, it
> rather
> > creates a link file if the hashed sub volume of the new name is different
> > than the older sub volume where the file was placed. the rename of a
> > directory does not change its layout (unless 'a still to be analyzed'
> > lookup races with the rename for layout fetching and healing). On any
> > future layout fixes due to added bricks or removed bricks, the layout
> > overlaps are computed so as to minimize data movements.
> >
> > Are you suggesting a change in behavior here, or am I missing something?
>
> Not really. I'm only considering the possibility of adding an additional
> step.
> The way rename works now is fine as it is now. I think that creating a link
> file is the most efficient way to be able to easily find the file in the
> future without wasting too much bandwidth and IOPS. However, as more and
> more
> file and directory renames are made, more and more data is left on the
> wrong
> brick and each access needs an additional jump. Even if this were cheap, a
> future layout change trying to minimize data movements will not be optimum
> because data is not really where it thinks it is.
>
> Recording all renames in an index each time a rename is made can allow a
> background daemon to scan it and incrementally process them to restore
> volume
> balance.
>
> > > Another thing to consider for future versions is to modify the current
> DHT
> > > to a consistent hashing and even the hash value (using gfid instead of
> a
> > > hash of the name would solve the rename problem). The consistent
> hashing
> > > would drastically reduce the number of files that need to be moved and
> > > already solves some of the current problems. This change needs a lot of
> > > thinking though.
> >
> > Firstly agree that this is an area to explore and nail better in the
> > _hopefully_ near future and that it takes some thinking time to get this
> > straight, while learning from the current implementation.
> >
> > Also, would like to point out to a commit that changes this for
> directories
> > using the GFID based hash rather than the name based hash, here [6]. It
> > does not address the rename problem, but starts to do things that you put
> > down here.
>
> That's good. I missed this patch. I'll look at it. Thanks :)
>
> Xavi
>
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at gluster.org
> http://supercolony.gluster.org/mailman/listinfo/gluster-devel
>

-- 
Raghavendra G
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-devel/attachments/20140703/79388f1a/attachment-0001.html>