[Gluster-devel] Re; Load balancing ...

Thu May 1 17:38:57 UTC 2008

On Thu, 1 May 2008, Martin Fick wrote:

>> Sounds like a good idea. The next question is where
>> to keep the log. 1 log per file? 1 log per
>> directory?
>> How to store them? Shadow files? Separate
>> shadow volume? A shadow volume might be a good idea
>> because it keeps the  main source mounted directory
>> exactly the same as a normal directory.
>
> I would start as simple as possible and adapt as
> necessary if you run into a performance problem.  The
> simplest design would probably be a shadow volume with
> one log per file with the a sparse mirrored directory
> structure.

Indeed, that's exactly what I was thinking. You would effectively need a 
container, like a namespace, to unify the two.

> Logs could be 24(?) bytes concatenated one
> after another making appending easy and reliable.  Or
> at a minor space cost (but potential added
> portability/extendability), each log file could even
> be a colon delimited line based ascii file (please
> don't anyone suggest an xml file!)
>
>  version1:start2:span2
>  version2:start2:span2
>  ...

If it's fixed length pointers (or in fact fixed length records), I'd go 
with packed binary format for efficiency and speed. These will have to be 
written to on every write. There would also need to be a header that 
states where the roll-over point is. Effectively, the log would be an RRD.

> Having a separate log file for each real file also
> makes it easy to code up some optimizations, for
> example: it would be easy to lookup the size of the
> log and the size of the real file.  As soon as the log
> becomes bigger than the real file it is no longer
> worth keeping as is!  It also makes it real easy to
> just delete the log if the real file is deleted.

Maybe have the default log be about 0.5% of the file in powers of 2, and 
not used for files below a certain size. Maybe grow/shrink it when it 
would exceed one step in powers of 2 from it's intended 0.5% size. This 
would mean that as the file grows, the log increases, but the log 
extention gets exponentially more rare. log truncation could be left until 
a suitable roll-over point. If you are syncing inodes, then that is 
typically 4KB, and a log entry would be, as you said, 24 bytes. That makes 
a log entry for a changed inode block about 1/170, which is about right 
for the 0.5% ball park.

> Another nice optimizer could make intelligent
> decisions about which log files to delete when the
> shadow volume starts to fill up.  By simply examining
> the size of each log versus the size of the real file
> one can set an upper bounds on how much transfer data
> the log could be saving (a real estimate would require
> adding all the spans together in the log file taking
> into account overlapping sections).

Sure - if you want to keep volumes separate. Or you could just maue sure 
that your log volume is always at least 1/170 of the data volume it's 
shadowing. Possibly a bit more for a safety margin with the lazy log 
resizing - around 1% ought to suffice for most sane cases.

> Finally, it would
> allow an admin to prune the shadow volume manually of
> whichever logs he chooses to prune.  An ascii file
> would make it easy to script various pruners.

I think that starts getting potentially dangerous. I think just having the 
logs volume at about 1% of the data volume would be better. Of course, if 
you keep both on the same physical volume, it won't matter.

> It would be nice to design the shadow volume so that
> it can be removed from the picture at any time without
> corrupting anything.

You already covered that with the sparse shadow volume tree. If there's no 
log, you resync the whole file.

> It would also be nice to ensure
> that the journal translator can handle an out of space
> condition.  This way each server is not required to
> even have the same size journal volume if any at all.

Note that this gets into the chicken and the egg problem - the log files 
would still need to be syncable directly using the current method - or 
you'd need a journal for your journalling volume. But if the journal is 
typically < 1% of the file, that's probably cheap enough that it won't 
matter too much. You could also probably set the upper limit on the volume 
size, because past a certain point the file changes will be limited by 
disk speeds, so from there on a bigger file doesn't imply more log space 
is required.

>> A (shadow volume) log should, ideally, also keep
>> additional sanity check information such as file
>> metadata (timestamps, size) for cross-check of
>> whether something went weird and the file was
>> changed underneath GlusterFS, and if it has, flush
>> out the log and force a full resync on the file.
>
> Hmm, this seems like an additional layer that might be
> nice (and perhaps an XML log would be appropriate
> here), but I would put it an separate inline
> translator so that it is not required.  The nice part
> is that if the protocol is extended to handle the
> journal layer, adding another separate layer like this
> would probably be easy!

For the sake of an extra few bytes in the log entry (8 byte time stamp + 8 
byte file size), I think it is probably worthwhile having it for 
crosscheck.

> Thanks again for your patience, I know it's not easy
> listening to back seat designers :)

I second that apology. :-)

Gordan