[Gluster-devel] [RFC] A new caching/synchronization mechanism to speed up gluster

Mon Feb 10 10:05:06 UTC 2014

On Tue, Feb 04, 2014 at 10:07:22AM +0100, Xavier Hernandez wrote:
> Hi,
> 
> currently, inodelk() and entrylk() are being used to make sure that
> changes happen synchronously on all bricks, avoiding data/metadata
> corruption when multiple clients modify the same inode concurrently.
> So far so good, however I think this introduces a significant
> overhead to avoid a situation that will happen very rarely. It also
> limits the advantage of client-side caches.
> 
> I propose to implement a new translator that uses a MESI-like
> protocol (protocol used to maintain memory coherency between local
> caches of CPU cores). This translator would add virtually 0 overhead
> when there isn't more than one client accessing the same inode, and
> an overhead comparable to current implementation if there is
> contention.
> 
> Another advantage of this protocol would be that it will be possible
> to implement much more aggressive caching mechanisms on the client
> side that will improve overall performance without losing any
> current features.
> 
> At a high level this is how it could work:
> 
> Each client tracks the state of each inode it uses (M - Modified, E
> - Exclusive, S - Shared, I - Invalid). All inodes will be created in
> the invalid state. When the client needs to write the inode, it asks
> all bricks exclusive access. Once granted, the inode will be in
> exclusive state and any read/write operation could be made locally
> on the client side, because it knows that nobody else will be
> modifying the inode. If the inode is successfully written (on the
> local cache), the state will change to modified. Eventually the
> changes will be sent to the bricks in background and the state will
> go back to exclusive, or invalid if the inode is not needed anymore.
> 
> Now, if another client needs to read or write the same inode, it
> will send a request to all bricks. If the inode is in the exclusive
> or modified state in one of the clients, the bricks will notify the
> current owner of the inode to flush all pending changes. Once
> completed, the new client will be granted exclusive (if it's a write
> request) or shared (if it's a read request) access to the inode. The
> former owner will leave the inode in the invalid state (if it's a
> write request) or shared (if it's a read request).
> 
> Multiple clients can read a shared inode simultaneously, however if
> one client needs exclusive access to the inode, all other clients
> will need to set inode's state to invalid before granting exclusive
> access.
> 
> The only synchronization point needed is to make sure that all
> bricks agree on the inode state and which client owns it. This can
> be achieved without locking using a method similar to what I
> implemented in the DFC translator.
> 
> Besides the lock-less architecture, the main advantage is that much
> more aggressive caching strategies can be implemented very near to
> the final user, increasing considerably the throughput of the file
> system. Special care has to be taken with things than can fail on
> background writes (basically brick space and user access rights).
> Those should be handled appropiately on the client side to guarantee
> future success of writes.
> 
> Of course this is only a high level overview. A deeper analysis
> should be done to see what to do on each special case.
> 
> What do you think ?

This sounds very much like "delegations and callbacks" in NFSv4. It is 
an optional feature that servers do not need to support, and some 
clients can not support easily (think of firewalls blocking callbacks).  
The RFC for NFSv4 documents the feature pretty well:
- http://tools.ietf.org/html/rfc3530#section-9.2

I'd surely be interested in seeing something similar for the GlusterFS 
protocol, it definitely improved performance for certain workloads on 
NFS.

Niels