[Gluster-devel] Architecture advice

Mon Jan 12 20:00:44 UTC 2009

Martin Fick wrote:
> --- On Mon, 1/12/09, Gordan Bobic <gordan at bobich.net> wrote:
>>> ...
>>> No need for fencing simply because you now use HA
>>> translator. The assumption in this case is that the 
>>> servers can still talk to each other but that one 
>>> server's connection to the clients may have died.  
>> That means that 50% of the scope for failure will still
>> wipe you out because you'll start splitbraining. Not the
>> way forward at all. A fencing setup will at least preserve
>> the data integrity. 
> 
> Fencing won't help either without cooperation, see below...
> 
> 
>> The correct way to handle comms channel
>> failure between client and server is to have bonded
>> interfaces going via different physical paths. _ONLY_
>> dealing with the situation where both servers are alive and
>> connected to each other but we can only reach one due to an
>> obscure failure somewhere in the network (e.g. a failed
>> switch port or a failed NIC in the server) is a pretty
>> half-arsed edge case.
> 
> Why is that the correct way?  There's nothing wrong with 
> having "bonding" at the glusterfs protocol level, is 
> there?

The problem is that it only covers a very narrow edge case that isn't 
all that likely. A bonded NIC over separate switches all the way to both 
servers is a much more sensible option. Or else what failure are you 
trying to protect yourself against? It's a bit like fitting a big 
padlock on the door when there's a wall missing.

> That is somewhat what the HA translator is, except 
> that it is supposed to take care of some additional 
> failures.  It is supposed to retransmit "in progress" 
> operations that have not succeeded because of comm 
> failures (I have yet to figure out where in the code 
> this happens though).

This is a reinvention of a wheel. NFS already handles this gracefully 
for the use-case you are describing.

>> Why re-invent the wheel when the tools to deal with these
>> failure modes already exist?
> 
> Are you referring to bonding here? If so, see above 
> why HA may be better (or additional benefit).

My original point is that it doesn't add anything new that you couldn't 
achieve with tools that are already available.

>>> Any failures on the server side may still warrant a
>>> fencing setup, but AFR is not yet setup to work 
>>> cooperatively with a fencing setup.
>> It doesn't have to be. If one server in AFR dies
>> nothing spectacular happens. Things time out and carry on. I
>> don't see what cooperation there would need to be. RHCS
>> does it's own heart-beating and fencing. Mix and match
>> as required.
> 
> Yes, if a server goes down you are fine (aside from the
> scenario where the other server then goes down followed
> by the first one coming back up).  But, if you are using
> the HA translator above and the communication goes down
> between the two servers you may still get split brain 
> (thus the need for heartbeat/fencing).

And therein lies the problem - unless you are proposing adding a 
complete fencing infrastructure into glusterfs, too.

> But, even with the current write logging in AFR, there 
> are possible split brain scenarios which can not be 
> avoided even with heartbeat/fencing (yet).  Anytime two 
> different clients try to write to the same area of the 
> filessystem and the network is segregated, there is a 
> chance that they each succeed and fail on opposite 
> servers causing split brain.  There is nothing heartbeat 
> can do about this except attempt to mitigate the problem 
> by intervening.  But heartbeat has no hooks to know when 
> this happens so by the time heartbeat intervenes, 
> "half writes" to each server may have occurred that 
> cannot be undone. That is the reason you really need 
> cooperation between AFR and some other tool (such as 
> heartbeat).

No, that's the whole point. You DON'T need that cooperation. If AFR is 
server-side, if the server's disconnect, cluster/heartbeat will 
disconnect, too, which will initialize fencing and failover, 
hard-power-off the failed server, and everything lives happily ever 
after. GlusterFS doesn't need to be aware. One node disappears. That's 
about the size of it. When it comes back, if files were written to, 
their timestamps will be newer, so on next read, they'll get synced to 
the re-joined node.

> AFR needs to be able write all or nothing 
> to all servers until some external policy machine 
> (such as heartbeat) decides that it is safe (because 
> of fencing or other mechanism) to proceed writing to 
> only a portion of the subvolumes (servers).  Without 
> this I don't see how you can prevent split brain?

With server-side AFR, splitbrain cannot really occur (OK, there's a tiny 
window of opportunity for it if the server isn't really totally dead 
since there's no total FS lock-out until fencing is completed like on 
GFS, but it's probably close enough). If the server's can't heartbeat to 
each other, they can't AFR to each other, either. So either the write 
gets propagated, or it doesn't. The machine that remained operational 
will have more up to date files and as necessary those will get synced 
back. It's not quite as tight as GFS in terms of ensuring data 
consistency like a DRBD+GFS solution would be, but it is probably close 
enough for most use-cases.

Gordan