[Gluster-devel] self-heal behavior
Gerry Reno
greno at verizon.net
Wed Jul 4 14:24:45 UTC 2007
Avati,
Comments inline...
Anand Avati wrote:
> Gerry,
> your question is appropriate, but the answer to 'when to resync' is
> not very simple. when a brick which was brought down is brought up
> later, it may be a completely new (empty) brick. In that case starting
> to sync every file would most likely be the wrong decision. (we should
> rather sync the file which the user needs than some unused file). Even
> if we chose to sync files without user accessing them it would be very
> sluggish too since it would be intervening in other operations.
Self-heal should start immediately to sync files but not at full speed
but rather at some throttled nice level that would not impact operations.
>
> The current approach is to sync files on the next open() on it. This
> is usually a good balance since, during open() if we were to sync a
> file, even if it was a GB it would take 10-15 secs, and for normal
> files (in the order of few MBs) it is almost not noticable. But if
> this were to happen together for all files whether the user accessed
> them or not there would be a lot of traffic and be very sluggish.
Again this should be done at a throttled level if there were other
operations happening, if not then throttle it up.
>
> This approach of syncing on open() is what even other filesystems
> which support redundancy do.
>
> Detecting 'idle time' and beginning sync-up and pausing the sync-up
> when user begins activity is a very tricky job, but that is definitely
> what we aim at finally. It is not enough if AFR detects the client is
> free, because the servers may be busy serving files to another client
> and syncing at that time may not be the most apprpriate time. The
> following versions of AFR will have more options to tune 'when' to
> sync. Currently it is only at open(). We plan to add options to make
> it sync on lookup() (happens on ls). Later versions would have
> pro-active syncing (detecting that both server and clients are idle etc).
That will be great.
Gerry
>
> thanks,
> avati
>
> 2007/7/4, Gerry Reno <greno at verizon.net <mailto:greno at verizon.net>>:
>
> I've been doing some testing of self-heal. Basically taking
> down one
> brick and then copying some files to one of the client mounts, then
> bringing the downed brick back up. What I see is that when I
> bring the
> downed brick back up, no activity occurs. It's only when I start
> doing
> something in one of the client mounts that something occurs to rebuild
> the out-of-sync brick. My concern with this is that if I have four
> applications on different client nodes (separate machines) using the
> same data set (mounted on GlusterFS). The brick on one of these nodes
> is out-of-sync, and it is not until some user is trying to use the
> application that the brick starts to resync. This results in
> sluggish
> performance to the user as all the data has to be brought over the
> network from other bricks since the local brick is out-of-sync. Now
> there may have been ten minutes of idle time prior to this user trying
> to access the data but glusterfs did not make any use of this time to
> rebuild the out-of-sync brick but rather waited until a user tried to
> access data. To me, it appears that glusterfs should be making use of
> such opportunity and this would diminish the overall impact to
> users of
> the out-of-sync condition.
>
> Regards,
> Gerry
>
>
>
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at nongnu.org <mailto:Gluster-devel at nongnu.org>
> http://lists.nongnu.org/mailman/listinfo/gluster-devel
>
>
>
>
> --
> Anand V. Avati
More information about the Gluster-devel
mailing list