[Gluster-devel] Choice of Translator question

Thu Dec 27 17:58:13 UTC 2007

Gareth Bult wrote:
>> The trusted.afr.version extended attribute tracks while file
>> version is being used, and on a read, all participating AFR members
>> should respond with this information, and any older/obsoleted file
>> versions are replaced by a newer copy from one of the valid AFR
>> members (this is self-heal)
> 
> Yes, understood.
> 
>> I think they are planning striped reads per block (maybe definable)
>> at a later date.
> 
> Mmm, so at the moment, when it says AFR does striped reads, what it
> really means is that it does striped reads, just so long as you have
> lots of relatively small files and not a few large files .. ???

I'm not sure.  It could very well depend on which version you are using, 
and where you read that.  I'm sure some features listed in the wiki are 
only implemented in the TLA releases until they put out the next point 
release.

>> Read from the the file from a client (head -c1 FILE >/dev/null to
>> force)
> 
> OR find /mountedfs -exec head -c1 > /dev/null {} \;
> 
> .. which is good, but VERY inefficient for a large file-system.

Agreed, which is why I just showed the single file self-heal method, 
since in your case targeted self heal (maybe before a full filesystem 
self heal) might be more useful.

>> you could use the stripe translator over AFR to AFR chunks of the
>> DB file, thus allowing per chunk self-heal.
> 
> Mmm, my experimentation indicates that this does not happen. I've
> just spent 3 hours trying to prove / disprove this with various
> configurations - AFR self-heals on a file basis, not on a
> stripe-chunk basis.
> 
> If I have 4 bricks, two stripes using 2 bricks each, then an AFR on
> top - any sort of self-heal replicates the entire DB. If I have 4
> bricks, two AFR's and one stripe on top, I get the same thing.

I would expect AFR over stripe to replicate the whole file on 
inconsistent AFR versions, but I would have though stripe over AFR would 
work, as the AFR should only be seeing chunks of files.  I don't see how 
the AFR could even be aware the chunks belong to the same file, so how 
it would know to replicate all the chunks of a file is a bit of a 
mystery to me.  I will admit I haven't done much with the stripe 
translator though, so my understanding of it's operation may wrong.

>> I'm not familiar enough with database file writing practices in
>> general (not to mention your particular database's practices), or
>> the stripe translator to tell whether any of the following will
>> cause you problems, but they are worth looking into:
> 
> We're talking about flat files here, some with append, some with
> seek/write updates.

Eh, it's probably not a problem anyways because of the way filesystems 
do block management.

>> 1) Will the overhead the stripe translator introduces with a very
>> large file and relatively small chunks cause performance problems?
>> (5G in 1MB stripes = 5000 parts...)
> 
> No, this would be fine if the AFR/Stripe combination actually did a
> per-chunk self heal.

I was thinking the stripe translator may add some extra overhead to the 
network, but it probably only requests the stripes that hold data you 
are requesting, so it probably is a non-issue (as you said).

>> 2) How will GlusterFS handle a write to a stripe that is currently
>> self-healing?  Block?
> 
> The stripe replicates the entire stripe (which is big) and both read
> and write operations block during the heal.

Do you mean that a change to a stripe replicates the entire file?

>> 3) Does the way the DB writes the DB file cause massive updates
>> throughout the file, or does it generally just append and update
>> the indices, or something completely different.  It could have an
>> affect on how well something like this works.
> 
> I don't think access speed is an issue, glusterfs is very quick. The
> issue is recovery, it appears not to operate as advertised!

Understood.  I'll have to actually try this when I have some time, 
instead of just doing some armchair theorizing.

>> Essentially, using this layout, you are keeping track of which
>> stripes have changed and only have to sync those particular ones on
>> self-heal. The longer the downtime, the longer self-heal will take,
>> but you can mitigate that problem with a rsync  of the stripes
>> between the active and failed GlusterFS nodes BEFORE starting
>> glusterfsd onthe failed node (make sure to get the extended
>> attributes too).
> 
> Ok, firstly, manual rsync's sort of defeat the object of the
> exercise. Secondly, having to go through this process every time a
> configuration is changed / glusterfsd is restarted is unworkable. 
> Thirdly, replicating many GB's of data hammers the IO system and
> slows down the entire cluster - again undesirable.

Well, it depends on your goal.  I only suggested rsync for when a node 
was offline for quite a while, which meant a large number of stripe 
components would have needed to be updates, requiring a long sync time. 
  If it was a quick outage (glusterfs restart or system reboot), it 
wouldn't be needed.  Think of it as a jumpstart on the self-heal process 
without blocking.

This, of course, was assuming that the stripe of AFR setup works.

> Being able to restart a glusterfsd without breaking the replica's
> would help, but I see no mention of this ...

Because I'm not a dev, and have no control over this.  ;)  Yes, I would 
like this feature as well, although I can imagine a couple of snags that 
can make it problematic to implement.

>> The above setup, if feasible, would mitigate restart cost, to the
>> point where only a few megs might need to be synced on a glusterfs
>> restart.
> 
> Ok, well I appear to have both AFR and Striping working and I can
> observe their operation at brick level and confirm they are working
> Ok.
> 
> Here's my basic test harness;
> 
> On the client system;
> 
> $dd if=/dev/zero of=/mnt/stripe/database bs=1M count=1024
> 
> write.py #!/usr/bin/python io=open("/mnt/stripe/database","r+") 
> io.seek(1024*1024*900) io.write("Change set version # 6\n") 
> io.close()
> 
> On the bricks I have;
> 
> read.py #!/usr/bin/python io=open("/export/stripe-1/database","r+") 
> io.seek(1024*1024*900) print io.readline() io.close()
> 
> When I run write.py on the client, both bricks show the correct
> change. Then I kill glusterfsd on brick2. Running write.py on the
> client shows an update on brick1, obviously not on brick2. Restarting
> glusterfsd on brick2 shows a reconnect in the logs. On the client;
> head -c1 database Initiates a self heal, shown in the logs with DEBUG
> turned on Running read.py on brick1 and brick2 blocks ... An entire
> 1G chunk is copied to brick 2 read.py on bricks 1 and 2 then continue
> when the copy finishes ..
> 
> (!)

Was this on AFR over stripe or stripe over AFR?

> I'm using fuse-2.7.2 from the repos and gluster 1.3.7 from the stable
> tgz ...
> 
> fyi; The fuse that comes with Ubuntu/Gutsy seems to cause gluster to
> crash under write-load, I'm still waiting to see if the current CVS
> version solves the problem ...

The GlusterFS provided fuse is supposed to have some better default 
values for certain variables relating to transfer block size or some 
such that optimize it for glusterfs, and it's probably what they test 
against, so it's what I've been using.

-- 

-Kevan Benson
-A-1 Networks