[Gluster-devel] Client side AFR race conditions?

Tue May 6 21:25:16 UTC 2008

Derek Price wrote:

>> I'm not saying I don't want to see a more robust solution for client 
>> side AFR, just that each configuration has it's place, and client side 
>> AFR isn't currently (and may never be) capable of serving a share that 
>> requires high data integrity.

As far as I can see, there is no practical difference in this regard 
between client and server-side AFR. Throw multiple clients at multiple 
servers, and you have the exact same problem.

>> If you think fixing this current issue will solve your problems, maybe 
>> you haven't considered the implications of connectivity problems 
>> between some clients and some (not all) servers...  Add in some 
>> clients with slightly off timestamps and you might have some major 
>> problems WITHOUT any reboots.

Exactly what I'm thinking. But then we're back to a tightly coupled 
cluster FS design like GFS or OCFS2: implicit write locking, journalled 
metadata for replay, and quorum requirements. Just about the only thing 
that can be sanely avoided is mandatory fencing, and that only because 
there is no shared storage, so one node going nuts cannot trash the 
entire FS after the other nodes boot it out.

If anybody can come up with an idea of how to achieve guaranteed 
consistency across all nodes without the above, I'd love to hear about it.

> Am I getting this straight?  Even with server-side AFR, you get mirrors, 
> but if all the clients aren't talking to the same server then there is 
> no forced synchronization going on?  How hard would it be to implement 
> some sort of synchronization/locking layer over AFR such that reads and 
> writes could still go to the nearest (read: fastest) possible server yet 
> still be guaranteed to be in sync?

You'd need global implicit write locks.

> In other words, the majority of servers would know of new version 
> numbers being written anywhere and yet reads would always serve local 
> copies (potentially after waiting for synchronization).

Unlocked files can always be read. Read-locked files can always be read. 
Write-locked files can, in theory, be neither read nor written until 
unlocked, because there is no guaranteed consistency. The write lock 
also cannot be released until the metadata is journalled and all the 
nodes have acknowledged the write back.

The problem with this is that you still have to verify on every read 
that there are no current write-locks in place. With strong quorum 
requirements and write-lock synchronisation, this could potentially be 
done away with, though, if you can guarantee that all connected nodes 
will always be aware of all write locks, and can acknowledge them before 
the lock is granted. This would mean, theoretically, NFS / local read 
performance, with an unavoidable write overhead. But at least this is 
not too bad because under normal circumstances there will be some orders 
of magnitude more reads than writes.

> The application 
> I'm thinking of is virtualized read/write storage.  For example, say you 
> want to share some sort of data repository with offices in Europe, 
> India, and the U.S. and you only have slow links connecting the various 
> offices.  You would want all client access to happen against a local 
> mirror, and you would want to restrict traffic between the mirrors to 
> that absolutely required for locking and data synchronization.

The data transfers are already minimized if you have server-side AFR set 
up between sites (one mirror server on each site, with multiple clients 
at each site only connecting to the local server).

> The only thing I'm not quite sure of in this model is what to do if the 
> server processing a write operation crashes before the write finishes. I 
> wouldn't want reads against the other mirrors to have to wait 
> indefinitely for the crashed server to return, so the best I can come up 
> with is that "write locks" for any files that hadn't been mirrored to at 
> least one available server before a crash would need to be revoked on 
> the first subsequent attempted access of the unsynchronized file.  Then 
> when the crashed server came back up and tried to synchronize, it would 
> find that its file wasn't the current version and sync in the other 
> direction.

You heartbeat node status between the nodes. When a node drops out, the 
other nodes after a few seconds' grace period, boot it out, and release 
all it's locks. Unless a node has a lock, it's journal gets discarded on 
writes, so it cannot commit. Note that this means both file and 
directory metadata journalling, and they need to be combined in case of 
deleting and re-creating the same file name. If a file gets deleted, it 
should be safe to just say that file was deleted at version X in the 
directory metadata journal. All previous versions in the journal can be 
discarded, and the file metadata journal can also be reset to free up 
space. Until all connected nodes have acknowledged the journal commit, 
the file and directory metadata cannot be discarded.

One exception could be where the number of file resync journal entries 
exceeds it's storage limit, and we mark the file for full resync. The 
only tricky part then is dynamically adding and removing nodes from the 
cluster, but that can be solved in a reasonably straightforward way, by 
on-line quorum adjustments.

But the important point is that this will likely require a lot of time, 
effort and thought to implement. :-(

Gordan