[Gluster-devel] Choice of Translator question

Thu Dec 27 13:44:58 UTC 2007

>The trusted.afr.version extended attribute tracks while file version is 
>being used, and on a read, all participating AFR members should respond 
>with this information, and any older/obsoleted file versions are 
>replaced by a newer copy from one of the valid AFR members (this is 
>self-heal)

Yes, understood.

>I think they are planning striped reads per block (maybe definable) at a later date.

Mmm, so at the moment, when it says AFR does striped reads, what it really means is that it does striped reads, just so long as you have lots of relatively small files and not a few large files .. ???

>Read from the the file from a client (head -c1 FILE >/dev/null to force)

OR find /mountedfs -exec head -c1 > /dev/null {} \;

.. which is good, but VERY inefficient for a large file-system.

>you could use the stripe translator over AFR to AFR chunks of the DB 
>file, thus allowing per chunk self-heal.

Mmm, my experimentation indicates that this does not happen. I've just spent 3 hours trying to prove / disprove this with various configurations - AFR self-heals on a file basis, not on a stripe-chunk basis.

If I have 4 bricks, two stripes using 2 bricks each, then an AFR on top - any sort of self-heal replicates the entire DB.
If I have 4 bricks, two AFR's and one stripe on top, I get the same thing.

>I'm not familiar enough with database file writing practices in general (not to mention your 
>particular database's practices), or the stripe translator to tell 
>whether any of the following will cause you problems, but they are worth looking into:

We're talking about flat files here, some with append, some with seek/write updates.

>1) Will the overhead the stripe translator introduces with a very large file and relatively small chunks cause performance problems? (5G in 1MB stripes = 5000 parts...)

No, this would be fine if the AFR/Stripe combination actually did a per-chunk self heal.

>2) How will GlusterFS handle a write to a stripe that is currently self-healing?  Block?

The stripe replicates the entire stripe (which is big) and both read and write operations block during the heal.

>3) Does the way the DB writes the DB file cause massive updates throughout the file, or does it generally just append and update the indices, or something completely different.  It could have an affect on how well something like this works.

I don't think access speed is an issue, glusterfs is very quick. The issue is recovery, it appears not to operate as advertised!

>Essentially, using this layout, you are keeping track of which stripes have changed and only have to sync those particular ones on self-heal. The longer the downtime, the longer self-heal will take, but you can mitigate that problem with a rsync  of the stripes between the active and failed GlusterFS nodes BEFORE starting glusterfsd onthe failed node (make sure to get the extended attributes too).

Ok, firstly, manual rsync's sort of defeat the object of the exercise.
Secondly, having to go through this process every time a configuration is changed / glusterfsd is restarted is unworkable.
Thirdly, replicating many GB's of data hammers the IO system and slows down the entire cluster - again undesirable.

Being able to restart a glusterfsd without breaking the replica's would help, but I see no mention of this ...

>The above setup, if feasible, would mitigate restart cost, to the point where only a few megs might need to be synced on a glusterfs restart.

Ok, well I appear to have both AFR and Striping working and I can observe their operation at brick level and confirm they are working Ok.

Here's my basic test harness;

On the client system;

$dd if=/dev/zero of=/mnt/stripe/database bs=1M count=1024

write.py
#!/usr/bin/python
io=open("/mnt/stripe/database","r+")
io.seek(1024*1024*900)
io.write("Change set version # 6\n")
io.close()

On the bricks I have;

read.py
#!/usr/bin/python
io=open("/export/stripe-1/database","r+")
io.seek(1024*1024*900)
print io.readline()
io.close()

When I run write.py on the client, both bricks show the correct change.
Then I kill glusterfsd on brick2.
Running write.py on the client shows an update on brick1, obviously not on brick2.
Restarting glusterfsd on brick2 shows a reconnect in the logs.
On the client; head -c1 database
Initiates a self heal, shown in the logs with DEBUG turned on
Running read.py on brick1 and brick2 blocks ...
An entire 1G chunk is copied to brick 2
read.py on bricks 1 and 2 then continue when the copy finishes ..

(!)

I'm using fuse-2.7.2 from the repos and gluster 1.3.7 from the stable tgz ...

fyi; The fuse that comes with Ubuntu/Gutsy seems to cause gluster to crash under write-load, I'm still waiting to see if the current CVS version solves the problem ...

Gareth.