[Gluster-devel] Re; Load balancing ...

Martin Fick mogulguy at yahoo.com
Thu May 1 04:18:53 UTC 2008


--- Gordan Bobic <gordan at bobich.net> wrote:
> Martin Fick wrote:
> 
> >> A better solution would be to maintain a list of
> >> dirty blocks and use it during selfheal.
> > 
> > Agreed, but why not make it infinitely granular
> and
> > keep a list of dirty file spans instead of blocks?
> 
> > This should be extremely space efficient.
> 
> Is this complication and extra effort realy worth
> the benefit over straight rolling hash rsync
> approach? It seems to me that applying the 
> rsync method at read-time would be a fairly minor
> mod that would solve 99% of the problem. No extra
> book-keeping would be required, only a 
> change from copying the whole file to rsyncing the
> file.

If the rsync solution truly is minor, great, I am all
for it!  However I do not share your optimism that the
rsync method is minor, and I think that the journal
method is comparatively very minor and way less error
prone and more efficient to boot.

There are two parts in common to both solutions and I
will claim that both parts are easier with the journal
method.  The journal method does have an additional
third method, the actual journal logging (and
cleanup), which I believe is actually very simple. 
The two parts in common are: 1) determining which
parts of files need to be transferred and 2) the
protocol extensions to communicate/transfer these
parts.

For #1 in the journal case, simply consult the journal
and you have the answer, extremely simple!  In the
rsync case you must calculate this.  I do not believe
that the rsync algorithm could be characterized as
simple but I will grant you that at least there is
existing code out there that could be used to do it. 
Even with this benefit, porting this existing solution
to glusterfs surely couldn't be easier than looking up
extents in a file?

As for #2 a common protocol might even be of use for
this, allowing a potential for both solutions in the
future!  This part should be about the same complexity
for both solutions, certainly not more complex for the
journal method.


As for performance, I do not believe that it is even
close.  Part 2 should be the same.  For part 1 the
biggest benefit to both methods is achieved on large
files.  On large files as Garreth has pointed out, the
rsync method would mean large disk io/CPU usage on
both servers plus a descent amount of network io
comparing hashes.  The journal, method however would
require no CPU, no network io (which is likely to be
much more scarce than disk io), and fairly minor
server disk io on only one server reading a list of
dirty spans/extents from a file (this can be
optimized/limited so that it is always less than rsync
which must read the whole file).  Unless I am missing
something, the journal method in the big file case
blows away the performance of the rsync method and in
every case (including the small file case) is still
faster/less resource intensive?

This leaves the additional logging/cleanup task of the
journaling method.  I address your specific logging
performance concerns further down in this message.  I
believe that loggin will have a negligible if even
measurable impact on performance if done right since I
believe that there are many performance enhancements
that could be made to this.  For example:  only log
changes to files above a certain file size, perform
asynchronous logging so that it does not impact the
normal write path...  The disk io for each change is
change size independent and is probably only around 24
bytes.  If changes are logged without syncing to disk
(since failure is acceptable), many seeks can be
avoided.  Logging could even be memory cached for a
while so that it could be potentially deleted before
being written to disk (see cleanup below) avoiding any
disk io!


> Journal per se wouldn't work, because that implies
> fixed size and write-ahead logging. What would be
> required here is more like the snapshot style undo
> logging.

No, no need to keep the old data around.  We only need
to remember the start and span of each changed section
along with the file version of the change!  This is
much esier/space efficient than snapshots.  Excuse me
for being ignorant of the actual sizes of these three
parameters, but they can't be larger than 8 bytes
each, can they?  8*3 = 24 bytes.  A 100MB journal
filesystem could store almost 50 thousand different
file changes!

> The problem with this is that you have to:
> 
> 1) Categorically establish whether each server is
> connected and up to date for the file being
> checked, and only log if the server has 
> disconnected. This involves overhead.

Again, no, no need to add overhead to the logging,
leave that to the cleanup path.  The journal
translator (perhaps journal is not the best word, but
until a better one is suggested...), can be invisible
to the AFR layer during logging.  The journal layer
can simply log every byte range that is changed along
with the version without knowing whether any servers
are down.  As shown above the disk overhead is minor
or optimizeable to be minor which can be a journal
layer, not an AFR layer decision.  This means that a
client side AFR could literally see no overhead for
the logging (but potentially some minor overhead for
cleanup)

As for cleaning up unused versions, this can occur in
several efficient methods.  The AFR translator could
inform the journal of previously successful writes on
other nodes efficiently after the fact, in the next
(potentially unrelated) message packet (again, we are
talking minor bytes here) so that the change can be
quickly flushed (potentially before it was ever even
written to disk!).  Cleaning up no longer needed logs
due to a heal can be done during the healing itself. 
Again, this means a few bytes during already required
message packets.

All this keeps the logging and cleanup network
overhead to almost null, effectively a few more bytes
piggy backing on already existing message packets.


> 2) For each server that is down at the time, each
> other server would have to start writing the
> snapshot style undo logs (which would have to be 
> per server) for all the files being changed. This
> effectively multiplies the disk write-traffic by
> the number of offline servers on all the working 
> up to date servers.

No need for snapshot logging (see above), thus the
small amount of writting needs to occur only once on
each up server, not once for each down server.  There
are no multiplying/scaling issues here.


> The problem that arises then is that the fast(er)
> resyncs on small changes come at the cost of
> massive slowdown in operation when you have
> multiple downed servers. As the number of servers
> grows, this rapidly stops being a workable
> solution.

No snapshot assumption, no massive slowdown.  One
write per up server no matter how many down servers. 
This scales nicely since the writes are all on
separate servers.


I realize I may not convince you of all of this, and
that you guys have probably spent a lot of time
thinking about this and that there are surely other
issues which I have not thought of.  Are there any
other known/perceived issues?

In spite of my pigheadedness and refusal to drop the
issue easily, I appreciate that you are taking the
time to discuss potential problems with what I believe
would be a good solution.  As a point of reference,
surely since other projects
such as DRBD have implemented similar logging
solutions (and not the rsync solution), they at least
must believe it to be superior. :)  Although, I would
argue that drbd could be easily modified and would
greatly benefit of the use rsync itself when it needs
to do a full sync!  Perhaps I will even suggest that
do the drbd list. :)

Thanks,

-Martin

P.S. While I believe the logging (without cleanup or
re-sync modifications) part will be negligible, and
since I believe that this is actually the easiest part
to implement, it could easily be prototyped to
determine its actual impact!  Would you be convinced
if these numbers turned out to be negligible? :)



      ____________________________________________________________________________________
Be a better friend, newshound, and 
know-it-all with Yahoo! Mobile.  Try it now.  http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ





More information about the Gluster-devel mailing list