[Gluster-devel] HA translator questions

Thu Jan 1 19:17:41 UTC 2009

--- On Thu, 1/1/09, Krishna Srinivas <krishna at zresearch.com> wrote:
> <mogulguy at yahoo.com> wrote:
> > --- On Thu, 1/1/09, Krishna Srinivas
> <krishna at zresearch.com> wrote:
> >>
> >> <mogulguy at yahoo.com> wrote:
> > Hmm, I don't see this looping on failure in the
> code, but my understanding of the translator design is
> fairly minimal.  I will have to look harder.  I was hoping
> to be able to modify the subvolume looping to be able to
> loop back upon itself indefinitely if all the subvolumes
> failed.  If this could be done, it seems like this would be
> an easy way to achieve NFS style blocking when the server is
> down (see my other thread on this), by simply using the HA
> translator with only one subvolume.
> 
> Just curious, why do you want the application to hang till
> the server comes back up? the indefinite hang is not desirable to most
> users. 

Because very few applications are written to recover from intermittent errors.  Once they see an error, they give up.  If you picture a bunch of clients relying on the FS on the server, if the server crashes they will likely all be hosed.  But since the client machines did not crash, they will likely never recover until someone reboots them.  Simply hanging and recovering when the server comes up is an essential feature for most networked filesystem clients.

> In case of NFS if the NFS server is down, won't the client
> error out saying that server is down?

No, it will hang indefinitely until the server comes up.  The clients will therefor not fail and simply continue along their own business as usual when the server returns with only a delay, no errors, no application restarts/reboots required.

> > Also, how about failure due to replies that do not
> > return because the link is down?  Are the requests saved
> > after they are sent until the reply arrives so that it can
> > be resent on the other link if the original link
> > successfully sends the request, but goes down afterwards and
> > cannot receive the reply?
> 
> > Yes requests are saved so that it can be retried on other
> > subvol if the current subvol goes down during operaion.

Cool, this brings up one last extreme corner case that concerns me with this.  What if client A sends a write request to file foo through HA to subvolume 1 and the link goes down after subvolume 1 services the request but before it can successfully reply that it has completed the write?  In this case you have confirmed that client A will retry on subvolume 2.  If subvolume 1 & 2 share the same backend, the write to file foo will already have taken place at this point.  This might make it possible for client B to read from file foo and write something new to it before the HA translator's client A write request to file foo is resent on subvolume 2.  When this resend from client A finally makes it to subvolume 2, it could then potentially rewrite the original write from client A on file foo overwriting client B's write which depended on client A's first write.

Is the scenario above possible?  Or would both subvolume 1 & 2 somehow know not to process client B's write request until they know that client A has received an ACK for it's original write request and therefor is not going to resend it?  I know that this is somewhat of a far fetched corner case, but if this is possible, I believe that unfortunately this would be non-posix compliant behavior.  This is the same concern I had with case #3 in my proposed fixes on my NFS blocking thread.  Make sense at all?

I wonder how NFS deals with a similar potential problem?  It seems like this (case #3, not the HA case) might be possible with NFS also unless it keeps track of all writes that it knows the client hasn't received an ACK to yet, and does not allow other writes to the same place until then?

Thanks again,

-Martin