[Gluster-devel] Choice of Translator question
Kevan Benson
kbenson at a-1networks.com
Wed Dec 26 22:00:22 UTC 2007
Gareth Bult wrote:
> Hi,
>
> Thanks for that, but I'm afraid I'd already read it ... :(
>
> The fundamental problem I have is with the method apparently employed
> by self-heal.
>
> Here's what I'm thinking;
>
> Take a 5G database sitting on an AFR with three copies. Normal
> operation - three consistent replica's, no problem.
>
> Issue # 1; glusterfsd crashes (or is crashed) on one node. That
> replica is immediately out of date as a result of continuous writes
> to the DB.
>
> Question # 1; When glusterfsd is restarted on the crashed node, how
> does the system know that node is out of date and should not be used
> for striped reads?
The trusted.afr.version extended attribute tracks while file version is
being used, and on a read, all participating AFR members should respond
with this information, and any older/obsoleted file versions are
replaced by a newer copy from one of the valid AFR members (this is
self-heal)
> My assumption; Because striped reads are per file and as a result,
> striping will not be applied to the database, hence there will be no
> read advantage obtained by putting the database on the filesystem ..
> ??
I think they are planning striped reads per block (maybe definable) at a
later date.
> Question # 2; Apart from closing the database and hence closing the
> file, how do we tell the crashed node that it needs to re-mirror the
> file?
Read from the the file from a client (head -c1 FILE >/dev/null to force).
> Question # 3; Mirroring a 5G file will take "some time" and happens
> when you re-open the file. While mirroring, the file is effectively
> locked.
>
> Net effect;
>
> a. To recover from a crash the DB needs a restart b. On restart, the
> DB is down for the time taken to copy 5G between machines (over a
> minute)
>
>> From an operational point of view, this doesn't fly .. am I missing
>> something?
you could use the stripe translator over AFR to AFR chunks of the DB
file, thus allowing per chunk self-heal. I'm not familiar enough with
database file writing practices in general (not to mention your
particular database's practices), or the stripe translator to tell
whether any of the following will cause you problems, but they are worth
looking into:
1) Will the overhead the stripe translator introduces with a very large
file and relatively small chunks cause performance problems? (5G in 1MB
stripes = 5000 parts...)
2) How will GlusterFS handle a write to a stripe that is currently
self-healing? Block?
3) Does the way the DB writes the DB file cause massive updates
throughout the file, or does it generally just append and update the
indices, or something completely different. It could have an affect on
how well something like this works.
Essentially, using this layout, you are keeping track of which stripes
have changed and only have to sync those particular ones on self-heal.
The longer the downtime, the longer self-heal will take, but you can
mitigate that problem with a rsync of the stripes between the active
and failed GlusterFS nodes BEFORE starting glusterfsd onthe failed node
(make sure to get the extended attributes too).
> Also, it appears that I need to restart glusterfsd when I change the
> configuration files (i.e. to re-read them) which effectively crashes
> the node .. is there a way to re-read a config without crashing the
> node? (on the assumption that as above, crashing a node is
> effectively "very" expensive...?)
The above setup, if feasible, would mitigate restart cost, to the point
where only a few megs might need to be synced on a glusterfs restart.
--
-Kevan Benson
-A-1 Networks
More information about the Gluster-devel
mailing list