[Gluster-devel] Choice of Translator question

Wed Dec 26 22:00:22 UTC 2007

Gareth Bult wrote:
> Hi,
> 
> Thanks for that, but I'm afraid I'd already read it ... :(
> 
> The fundamental problem I have is with the method apparently employed
> by self-heal.
> 
> Here's what I'm thinking;
> 
> Take a 5G database sitting on an AFR with three copies. Normal
> operation - three consistent replica's, no problem.
> 
> Issue # 1; glusterfsd crashes (or is crashed) on one node. That
> replica is immediately out of date as a result of continuous writes
> to the DB.
> 
> Question # 1; When glusterfsd is restarted on the crashed node, how
> does the system know that node is out of date and should not be used
> for striped reads?

The trusted.afr.version extended attribute tracks while file version is 
being used, and on a read, all participating AFR members should respond 
with this information, and any older/obsoleted file versions are 
replaced by a newer copy from one of the valid AFR members (this is 
self-heal)

> My assumption; Because striped reads are per file and as a result,
> striping will not be applied to the database, hence there will be no
> read advantage obtained by putting the database on the filesystem ..
> ??

I think they are planning striped reads per block (maybe definable) at a 
later date.

> Question # 2; Apart from closing the database and hence closing the
> file, how do we tell the crashed node that it needs to re-mirror the
> file?

Read from the the file from a client (head -c1 FILE >/dev/null to force).

> Question # 3; Mirroring a 5G file will take "some time" and happens
> when you re-open the file. While mirroring, the file is effectively
> locked.
> 
> Net effect;
> 
> a. To recover from a crash the DB needs a restart b. On restart, the
> DB is down for the time taken to copy 5G between machines (over a
> minute)
> 
>> From an operational point of view, this doesn't fly .. am I missing
>> something?

you could use the stripe translator over AFR to AFR chunks of the DB 
file, thus allowing per chunk self-heal.  I'm not familiar enough with 
database file writing practices in general (not to mention your 
particular database's practices), or the stripe translator to tell 
whether any of the following will cause you problems, but they are worth 
looking into:

1) Will the overhead the stripe translator introduces with a very large 
file and relatively small chunks cause performance problems? (5G in 1MB 
stripes = 5000 parts...)
2) How will GlusterFS handle a write to a stripe that is currently 
self-healing?  Block?
3) Does the way the DB writes the DB file cause massive updates 
throughout the file, or does it generally just append and update the 
indices, or something completely different.  It could have an affect on 
how well something like this works.

Essentially, using this layout, you are keeping track of which stripes 
have changed and only have to sync those particular ones on self-heal. 
The longer the downtime, the longer self-heal will take, but you can 
mitigate that problem with a rsync  of the stripes between the active 
and failed GlusterFS nodes BEFORE starting glusterfsd onthe failed node 
(make sure to get the extended attributes too).

> Also, it appears that I need to restart glusterfsd when I change the
> configuration files (i.e. to re-read them) which effectively crashes
> the node .. is there a way to re-read a config without crashing the
> node? (on the assumption that as above, crashing a node is
> effectively "very" expensive...?)

The above setup, if feasible, would mitigate restart cost, to the point 
where only a few megs might need to be synced on a glusterfs restart.

-- 

-Kevan Benson
-A-1 Networks