[Gluster-devel] Architecture advice

Tue Jan 13 01:15:56 UTC 2009

--- On Mon, 1/12/09, Gordan Bobic <gordan at bobich.net> wrote:

Ding, ding, ding ding!!!  I get it, you are 
using NFS to achieve blocking, exactly my #1
remaining grip with glusterfs, it does not 
block!  Please try explaining why this is
important to you to the glusterfs devs! I
am not sure that I made my case clear to 
them.  It seems like your use of NFS is 
primarily based upon this (what I perceive
to be) major remaining shortcoming of 
glusterfs.  Would you give up NFS if 
blocking were implemented in glusterfs?

One remaining drawback to NFS, which you 
may not care about, is the fact the NFS
servers should not themselves be NFS 
clients.  My desired operational scenario 
is a more "peer 2 peer" scenario in which
I would need my servers to be able to mount
their own subvolumes.

> > So, the HA translator supports talking to two
> > different servers with two different transport
> > mechanism and two different IPs.  Bonding does not
> support anything like this is far as I can tell.
> 
> True. Bonding is more transparent. You make two NICs into
> one virtual NIC and round-robin packets down them. If one
> NIC/path fails, all the traffic will fail over to the other
> NIC/path.

Another benefit of the HA translator is that 
you can have to entirely different paths which 
is very hard to do with bonding.  With bonding 
you are restricted to one IP.  If you think 
about using a WAN, this would not allow you to 
access a remote server using two entirely 
different IPs which use two entirely different 
WAN GWs.  The HA translator should in theory 
make this very easy.

> No, not at all. Multiple servers, 1 floating IP per server.
> Floating as in it can be migrated to the other server if one
> fails. You balance the load by assigning half of your
> clients to one floating IP, and the other half of the
> clients to the other floating IP. So, when both servers are
> up, each handles half the load. If one server fails,
> it's IP gets migrated to the other server, and all
> clients thereafter talk to the surviving server since it has
> both IPs (until the other server comes back up and asks for
> it's IP address back).

Got it.

> >> This is a reinvention of a wheel. NFS already
> >> handles this gracefully for the use-case you 
> >> are describing.
> > 
> > I am lost, what does NFS have to do with it?
> 
> It already handles the "server has gone away"
> situation gracefully. What I'm saying is that you can
> use GlusterFS underneath for mirroring the data (AFR) and
> re-export with NFS to the clients. If you want to avoid
> client-side AFR and still have graceful failover with
> lightweight transport, NFS is not a bad choice.

Uh, not exactly a good choice though, it seems like
an awfully big hammer to use just because you think
it's better than reinventing the wheel.  I can see
that it will work in your strict client/server use
case, but not in "peer 2 peer".  A simple HA 
translator would be a much better more flexible, 
better glusterfs integrated solution, don't you 
think?

> > How do current tools achieve the following
> > setup?  Client A talks to Server A and submits a read
> request.  The read request is received on Server A (TCP
> acked to the client), and then Server A dies.  How will
> > the following request be completed without
> > glusterfs returning an "endpoint not
> connected" error?
> 
> You make client <-> server comms NFS.
> You make server <-> server comms GlusterFS.
> 
> If the NFS server goes away, the client will keep retrying
> until the server returns. In this case, that would mean
> it'll keep retrying until the other server fails the IP
> address over to itself.
> 
> This achieves:
> 1) server side AFR with GlusterFS for redundancy
> 2) client connects to a single server via NFS so
> there's no double-bandwidth used by the client
> 3) servers can fail over relatively transparently to the
> client

Makes sense. 

> > No, I have not confirmed that this actually
> > works with the HA translator, but I was told
> > that the following would happen if it were used. 
> Client A talks to Server A and submits a read request.  The
> read request is received on Server A (TCP acked to the
> client), and then Server A dies.  Client A
> > will then in theory retry the read request
> > on Server B.  Bonding cannot do anything
> > like this (since the read was tcp ACKed)?  
> 
> Agreed, if a server fails, bonding won't help. Cluster
> fail-over server-side, however, will, provided the network
> file system protocol can deal with it reasonably well.

Yes, but I fear you might still have a corner case 
where you can get some non-posix behavior with this
setup, just as I mentioned that I believe you would 
with the HA translator.  

 -Client 1 writes a seq #(1) via server A file foo
 -Server A processes writes to file foo on both
   server A and B and dies without acking the
   write to client 1
 -Client 2 reads the seq#(1) from file foo via
   server B.  
 -Client 2 increments seq # to 2 and writes it to
   file foo via server B.
 -Client 1 retries its original write of 1 to file
   foo via server B which it believes failed via 
   server A and succeeds.

->> Beware file foo now contains 1 yet client 2 clearly 
read it as 1 and successfully incremented it to 2.  Tricky,
but evil!

> > I think that this is quite different from
> > any bonding solution.  Not better, different,
> > If I were to use this it would not preclude me from
> > also using bonding, but it solves a somewhat different
> > problem.  It is not a complete solution, it is a piece, but
> > not a duplicated piece.  If you don't like it,
> > or it doesn't fit your backend use case, don't
> > use it! :)
> 
> If it can handle the described failure more gracefully than
> what I'm proposing, then I'm all for it. I'm
> just not sure there is that much scope for it being better
> since the last write may not have made it to the mirror
> server anyway, so even if the protocol can re-try, it would
> need to have some kind of journaling, roll back the journal
> and replay the operation.

That's why I said theory about the HA translator! :) 

I do not see anything in the code that actually keeps
track of requests until they are replied to, but I
was told that it can replay it.  Can someone explain
where this is done?

I can' see how this is done without some type of RAM 
journal?  I say RAM, because request need not 
survive a client crash, they simply need to hit the 
server disk before the client return a success, but 
if the clients crashes, the apps never got a confirm, 
so request will not need to be replayed.

Why do you think a client would need to be able
to roll back the journal, it should just have to 
replay it, no roll back.

> This, however, is a much more complex approach (very
> similar to what GFS does), and there is a high price to pay
> in terms of performance when the nodes aren't on the
> same LAN.

With glusterfs's architecture it should not be much of 
a price, just the buffering of requests until they are 
completed.

> > No. I am proposing adding a complete transactional
> model to AFR so that if a write fails on one node, some
> policy can decide whether the same write should be committed
> of rolled back on the other nodes.  Today, the policy is to
> simply apply it to the other nodes regardless.  This is a
> recipe for split brain.  
> 
> OK, I get what you mean. It's basically the same
> problem I described above when I mentioned that you'd
> need some kind of a journal to roll-back the operation that
> hasn't been fully committed.

I don't see it at all like above, since above you do
not need to rollback.  in this case, depending on 
which side of the segregated network you are on, the
journal may need to be rolled back or committed.

> > In the case of network segregation some policy should
> > decide to allow writes to be applied
> > to one side of the segregation and denied on the
> > other.  This does not require fencing (but it
> > would be better with it), it could be a simple policy
> > like: "apply writes if a majority of nodes can be
> > reached", if not fail (or block would be
> > even better).
> 
> Hmm... This could lead to an elastic shifting quorum.
> I'm not sure how you'd handle resyncing if nodes are
> constantly leaving/joining. It seems a bit
> non-deterministic.

I wasn't trying to focus on a specific policy, but
I fail to see any actual problem as long as you always
have a majority?  Could you be specific about a 
problematic case?

I would suggest other policies also, thus my request
for an external hook.

> > I guess what you call tiny, I call huge.  Even if you
> > have your heartbeat fencing occur in under a
> > tenth of a second, that is time enough to split brain
> > a major portion of a filesystem.  I would never trust it.
> 
> In GlusterFS that problem exists anyway, 

Why "anyway"?  It exists, sure, but it's certainly
something that I would hope gets fixed eventually.

> but it is largely
> mitigated by the fact that it works on file level rather
> than block device level. 

Certainly not FS devastating like it would be for 
a block device, but bad data is still bad data.
It would be of no consolation to me that I have
access to the rest of my FS if one really 
important file is corrupt!

> > To borrow your analogy, adding heartbeat to the
> current AFR:  "It's a bit like fitting a big
> padlock on the door when there's a wall missing."
> > :)  
> > Every single write needs to ensure that it will not
> cause split brain for me to trust it.
> 
> Sounds like GlusterFS isn't necessarily the solution
> for you, then. :(

It's not all bad, it's just not usable for some 
use cases yet. 

> > If not, why would I bother with gluserfs over
> > AFR instead of glusterfs over DRBD?  Oh right, because
> I cannot get glusterfs to failover without
> > incurring connection errors on the client! ;)
> > (not your beef, I know, from another thread)
> 
> Precisely - which is why I originally suggested not using
> GlusterFS for client-server communication. :)
...
> And this is exactly why I suggested using NFS for the
> client<->server connection. NFS blocks until the
> server becomes contactable again.

Yes, but do you have any other suggestions
besides NFS?  Anything that can be safely 
used as both a client and a server? :)

> > But, to be clear, I am not disagreeing with you
> > that the HA translator does not solve the split
> > brain problem at all.  Perhaps this is what is really
> > "upsetting" you, not that it is
> > "duplicated" functionality, but rather that
> > it does not help AFR solve it's split brain personality
> > disorders, it only helps make them more available, thus
> > making split brain even more likely!! ;(
> 
> I'm not sure it makes it any worse WRT split-brain, it
> just seems that you are looking for GlusterFS+HA to provide
> you with exactly the same set of features that NFS+(server
> fail-over) already provides. 

You are right, glusterfs + AFR + HA is probably no 
worse than glusterfs + AFR + NFS.  But both make it 
slightly more likely to have split brain than simply 
glusterfs + AFR.  And glusterfs + AFR itself is much 
more likely to split brain than glusterfs + DRBD.

> Of course, there could be
> advantages in GlusterFS behaving the same way as NFS when
> the server goes away if it's a single-server setup 

I fail to see how having it not behave that way even
if you have many servers and they all went down would 
not be desirable?

> - it
> would be easier to set up and a bit more elegant. But it
> wouldn's add any functionality that couldn't be
> re-created using the sort of a setup I described.

I guess just multi path, multi protocol (encrypt one, 
not the other...).  Primarily flexibility, bonding is 
very limited.  I would think that it might in some
usecases increase bandwidth also.  My reading on 
bonding suggests that if you are using separate 
switches that you can get either HA bonding or 
link aggregation bonding, but not both right?

-Martin