[Gluster-devel] Client side AFR race conditions?

Fri May 2 18:17:24 UTC 2008

--- Krishna Srinivas <krishna at zresearch.com> wrote:
> > I am curious, is client side AFR susceptible
> > to race conditions on writes?  If not, how is this
> > mitigated?

> This is a known issue with the client side AFR. 

Ah, OK.  Perhaps it is already documented somewhere,
but I can't help but think that perhaps the AFR
translator deserves a page dedicated to some of the
design trade offs made and the impact the they have. 
With enough thought, it is possible to deduce/guess at
some of the potential problems such as split brain and
race conditions, but for most of us this is still a
guess until we ask on the list.  Perhaps with the help
of others I will setup a wiki page for this.  This
kind of documented info would probably help situations
like the one with Garreth where he felt mislead by the
glusterfs documentation.

> We can solve this by locking but there will be 
> performance hit. Of course if applications lock 
> themselves then all will be fine. I feel we can have

> it as an option to disable the locking
> in case users are more concerned about performance.
> 
> Do you have any suggestions?

I haven't given it a lot of thought, but, how would
the locking work?  Would you be doing:

  SubA          AFR      application     SubB
    |            |            |            |
    |            |<---write---|            |
    |            |            |            |
    |<---lock----|-----------lock--------->|
    |---locked-->|<---------locked---------|
    |            |            |            |
    |<--write----|----------write--------->|
    |--written-->|<--------written---------|
    |            |            |            |
    |<--unlock---|----------unlock-------->|
    |--unlocked->|<--------unlocked--------|
    |            |            |            |
    |            |---written->|            |

because that does seem to be a rather large 3
roundtrip latency versus the current single rountrip,
not including all the lock contention performance
hits!  This solution also has the problem of lock
recovery if a client dies.

If instead, a rank (which could be configurable or
random) were given to each subvolume on startup, one
alternative would be to always write to the highest
ranking subvolume first:

   (A is a higher rank than B)

  SubA         AFR         Application        SubB
    |           |               |               |
    |           |<----write-----|               |
    |<--write---|               |               |
    |--version->|               |               |
    |           |----written--->|               |
    |           |               |               |
    |           |----------(quick)heal--------->|
    |           |<------------healed------------|

The quick heal would essentially be the write but
knowing/enforcing the version # returned from the SubA
write.  Since all clients would always have to write
to SubA first, then SubA's ordering would be reflected
on every subvolume. While this solution leaves a
potentially larger time when SubB is unsynced, this
should maintain the single roundtrip latency from an
application's standpoint and avoid any lock contention
performance hits?  If a client dies in this scenario,
any other client could always heal SubB from SubA, no
lock recovery problems.

Both of these solutions could probably be greatly
enhanced with a write ahead log translator or some
form of buffering above each subvolume, this would
decrease the latency by allowing the write data to be
transferred before/while the lock/ordering info is
synchronized.  But this may be rather complicated? 
However, as is, they both seem like fairly simple
solutions without too much of a design change?

The non locking approach seems a little odd at first
and may be more of a change to the current AFR method
conceptually, but the more I think about it, the more
it seems appealing.  Perhaps it would not actually
even be a big coding change?  I can't help but think
that this method could also potentially be useful to
eliminate more splitbrain situations, but I haven't
worked that out yet.  

There is a somewhat suttle reason, but it makes sense
that the locking solution is slower since locking
enforces serialization across all the writes.  This
serialization is not really what is needed; we only
need to ensure that the potentially unserialized
ordering is the same on both subvolumes.

Thoughts?

-Martin

P.S. Simple ascii diagrams generated with:
http://www.theficks.name/test/Content/pmwiki.php?n=Sdml.HomePage

      ____________________________________________________________________________________
Be a better friend, newshound, and 
know-it-all with Yahoo! Mobile.  Try it now.  http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ