[Gluster-devel] Architecture advice

Mon Jan 12 23:41:24 UTC 2009

Martin Fick wrote:
>>> Why is that the correct way?  There's nothing
>> wrong with having "bonding" at the glusterfs
>> protocol level, is there?
>>
>> The problem is that it only covers a very narrow edge case
>> that isn't all that likely. A bonded NIC over separate
>> switches all the way to both servers is a much more sensible
>> option. Or else what failure are you trying to protect
>> yourself against? It's a bit like fitting a big padlock
>> on the door when there's a wall missing.
> 
> I think you need to be more specific then using 
> analogies.  My only guess from your assertions is 
> that you have a very narrow specific use case /
> setup / terminology in mind that does not 
> necessarily mesh with my narrow use case ... :)

LOL! That is a distinct possibility. :)

> So, the HA translator supports talking to two
> different servers with two different transport
> mechanism and two different IPs.  Bonding does 
> not support anything like this is far as I can 
> tell.

True. Bonding is more transparent. You make two NICs into one virtual 
NIC and round-robin packets down them. If one NIC/path fails, all the 
traffic will fail over to the other NIC/path.

> So, it seems like you are assuming a
> different back end use case, one where the 
> servers employ the same IP perhaps using round 
> robin or perhaps in an active passive way.

No, not at all. Multiple servers, 1 floating IP per server. Floating as 
in it can be migrated to the other server if one fails. You balance the 
load by assigning half of your clients to one floating IP, and the other 
half of the clients to the other floating IP. So, when both servers are 
up, each handles half the load. If one server fails, it's IP gets 
migrated to the other server, and all clients thereafter talk to the 
surviving server since it has both IPs (until the other server comes 
back up and asks for it's IP address back).

> Both
> of these are very different beasts and I would
> need to know which you are talking about to
> understand what you are getting at.  But the HA
> translator setup is closer to the round robin
> (active/active) setup and I am guessing you 
> are taking about an active / passive setup.

In general, there are relatively few things that you cannot make 
active/active, so I always mean active/active + failover unless I 
explicitly state it.

>>> That is somewhat what the HA translator is, except
>> that it is supposed to take care of some additional
>> failures.  It is supposed to retransmit "in
>> progress" operations that have not succeeded because of
>> comm failures (I have yet to figure out where in the code
>> this happens though).
>>
>> This is a reinvention of a wheel. NFS already handles this
>> gracefully for the use-case you are describing.
> 
> I am lost, what does NFS have to do with it?

It already handles the "server has gone away" situation gracefully. What 
I'm saying is that you can use GlusterFS underneath for mirroring the 
data (AFR) and re-export with NFS to the clients. If you want to avoid 
client-side AFR and still have graceful failover with lightweight 
transport, NFS is not a bad choice.

>>>> Why re-invent the wheel when the tools to deal
>> with these
>>>> failure modes already exist?
>>> Are you referring to bonding here? If so, see above
>> why HA may be better (or additional benefit).
>>
>> My original point is that it doesn't add anything new
>> that you couldn't achieve with tools that are already
>> available.
> 
> 
> Well, I was trying to explain to you that it
> does, but then the NFS thing, I am confused.
> 
> How do current tools achieve the following
> setup?  Client A talks to Server A and 
> submits a read request.  The read request 
> is received on Server A (TCP acked to the 
> client), and then Server A dies.  How will
> the following request be completed without
> glusterfs returning an "endpoint not 
> connected" error?

You make client <-> server comms NFS.
You make server <-> server comms GlusterFS.

If the NFS server goes away, the client will keep retrying until the 
server returns. In this case, that would mean it'll keep retrying until 
the other server fails the IP address over to itself.

This achieves:
1) server side AFR with GlusterFS for redundancy
2) client connects to a single server via NFS so there's no 
double-bandwidth used by the client
3) servers can fail over relatively transparently to the client

> No, I have not confirmed that this actually
> works with the HA translator, but I was told
> that the following would happen if it were 
> used.  Client A talks to Server A and 
> submits a read request.  The read request 
> is received on Server A (TCP acked to the 
> client), and then Server A dies.  Client A
> will then in theory retry the read request
> on Server B.  Bonding cannot do anything
> like this (since the read was tcp ACKed)?  

Agreed, if a server fails, bonding won't help. Cluster fail-over 
server-side, however, will, provided the network file system protocol 
can deal with it reasonably well.

> Neither can heartbeat/failover
> of an active/passive backend since on the
> first failure the client will get a 
> connection error and the glusterfs client
> protocol does not retransmit).

This is where I clearly failed to clarify what I meant. I was talking 
about using NFS for the client<->server part of the communication. NFS 
will typically block until the server starts responding again (note: it 
doesn't have to be the same server, just one like it).

> I think that this is quite different from
> any bonding solution.  Not better, different,
> If I were to use this it would not preclude 
> me from also using bonding, but it solves a 
> somewhat different problem.  It is not a 
> complete solution, it is a piece, but not
> a duplicated piece.  If you don't like it,
> or it doesn't fit your backend use case, 
> don't use it! :)

If it can handle the described failure more gracefully than what I'm 
proposing, then I'm all for it. I'm just not sure there is that much 
scope for it being better since the last write may not have made it to 
the mirror server anyway, so even if the protocol can re-try, it would 
need to have some kind of journaling, roll back the journal and replay 
the operation.

This, however, is a much more complex approach (very similar to what GFS 
does), and there is a high price to pay in terms of performance when the 
nodes aren't on the same LAN.

>>> Yes, if a server goes down you are fine (aside from the
>>> scenario where the other server then goes down followed
>>> by the first one coming back up).  But, if you are using
>>> the HA translator above and the communication goes down
>>> between the two servers you may still get split brain
>>> (thus the need for heartbeat/fencing).
>> And therein lies the problem - unless you are proposing
>> adding a complete fencing infrastructure into glusterfs,
>> too.
> 
> No. I am proposing adding a complete transactional 
> model to AFR so that if a write fails on one node, 
> some policy can decide whether the same write 
> should be committed of rolled back on the other 
> nodes.  Today, the policy is to simply apply it to 
> the other nodes regardless.  This is a recipe for 
> split brain.  

OK, I get what you mean. It's basically the same problem I described 
above when I mentioned that you'd need some kind of a journal to 
roll-back the operation that hasn't been fully committed.

> In the case of network segregation some policy 
> should decide to allow writes to be applied
> to one side of the segregation and denied on the 
> other.  This does not require fencing (but it
> would be better with it), it could be a simple 
> policy like: "apply writes if a majority of nodes 
> can be reached", if not fail (or block would be
> even better).

Hmm... This could lead to an elastic shifting quorum. I'm not sure how 
you'd handle resyncing if nodes are constantly leaving/joining. It seems 
a bit non-deterministic.

>>> AFR needs to be able write all or nothing to all
>>> servers until some external policy machine (such as
>>> heartbeat) decides that it is safe (because of fencing or
>>> other mechanism) to proceed writing to only a portion of the
>>> subvolumes (servers).  Without this I don't see how you
>>> can prevent split brain?
>> With server-side AFR, splitbrain cannot really occur (OK,
>> there's a tiny window of opportunity for it if the
>> server isn't really totally dead since there's no
>> total FS lock-out until fencing is completed like on GFS,
>> but it's probably close enough). If the server's
>> can't heartbeat to each other, they can't AFR to
>> each other, either. So either the write gets propagated, or
>> it doesn't. The machine that remained operational will
>> have more up to date files and as necessary those will get
>> synced back. It's not quite as tight as GFS in terms of
>> ensuring data consistency like a DRBD+GFS solution would be,
>> but it is probably close enough for most use-cases.
> 
> 
> I guess what you call tiny, I call huge.  Even if 
> you have your heartbeat fencing occur in under a
> tenth of a second, that is time enough to split 
> brain a major portion of a filesystem.  I would 
> never trust it.

In GlusterFS that problem exists anyway, but it is largely mitigated by 
the fact that it works on file level rather than block device level. In 
the case of GFS, RHCS will block all access to the file system until the 
note is successfully fenced and confirmed fenced before rolling back 
it's journals and resuming operation.

> To borrow your analogy, adding heartbeat to the 
> current AFR:  "It's a bit like fitting a big 
> padlock on the door when there's a wall missing."
> :)  
> 
> Every single write needs to ensure that it will 
> not cause split brain for me to trust it.

Sounds like GlusterFS isn't necessarily the solution for you, then. :(

> If not, why would I bother with gluserfs over
> AFR instead of glusterfs over DRBD?  Oh right, 
> because I cannot get glusterfs to failover without
> incurring connection errors on the client! ;)
> (not your beef, I know, from another thread)

Precisely - which is why I originally suggested not using GlusterFS for 
client-server communication. :)

> This is one reason I was hoping that the HA
> translator would address this, but the HA
> translator is useless in an active/passive
> backend setup, it only works in active/active.
> If you try using it in an active/passive setup,
> during failover it will retry too quickly on
> the second server causing connection errors
> on the client!!!  This is the primary reason
> that I am suggesting that the HA translator
> block until the connection is restored, it
> would allow for failovers to occur.

And this is exactly why I suggested using NFS for the client<->server 
connection. NFS blocks until the server becomes contactable again.

> But, to be clear, I am not disagreeing with you
> that the HA translator does not solve the split
> brain problem at all.  Perhaps this is what is 
> really "upsetting" you, not that it is
> "duplicated" functionality, but rather that it 
> does not help AFR solve it's split brain 
> personality disorders, it only helps make them 
> more available, thus making split brain even 
> more likely!! ;(

I'm not sure it makes it any worse WRT split-brain, it just seems that 
you are looking for GlusterFS+HA to provide you with exactly the same 
set of features that NFS+(server fail-over) already provides. Of course, 
there could be advantages in GlusterFS behaving the same way as NFS when 
the server goes away if it's a single-server setup - it would be easier 
to set up and a bit more elegant. But it wouldn's add any functionality 
that couldn't be re-created using the sort of a setup I described.

Gordan