[Gluster-devel] [RFC] A new caching/synchronization mechanism to speed up gluster
xhernandez at datalab.es
Mon Feb 10 10:17:12 UTC 2014
El 10/02/14 11:05, Niels de Vos ha escrit:
> On Tue, Feb 04, 2014 at 10:07:22AM +0100, Xavier Hernandez wrote:
>> currently, inodelk() and entrylk() are being used to make sure that
>> changes happen synchronously on all bricks, avoiding data/metadata
>> corruption when multiple clients modify the same inode concurrently.
>> So far so good, however I think this introduces a significant
>> overhead to avoid a situation that will happen very rarely. It also
>> limits the advantage of client-side caches.
>> I propose to implement a new translator that uses a MESI-like
>> protocol (protocol used to maintain memory coherency between local
>> caches of CPU cores). This translator would add virtually 0 overhead
>> when there isn't more than one client accessing the same inode, and
>> an overhead comparable to current implementation if there is
>> Another advantage of this protocol would be that it will be possible
>> to implement much more aggressive caching mechanisms on the client
>> side that will improve overall performance without losing any
>> current features.
>> At a high level this is how it could work:
>> Each client tracks the state of each inode it uses (M - Modified, E
>> - Exclusive, S - Shared, I - Invalid). All inodes will be created in
>> the invalid state. When the client needs to write the inode, it asks
>> all bricks exclusive access. Once granted, the inode will be in
>> exclusive state and any read/write operation could be made locally
>> on the client side, because it knows that nobody else will be
>> modifying the inode. If the inode is successfully written (on the
>> local cache), the state will change to modified. Eventually the
>> changes will be sent to the bricks in background and the state will
>> go back to exclusive, or invalid if the inode is not needed anymore.
>> Now, if another client needs to read or write the same inode, it
>> will send a request to all bricks. If the inode is in the exclusive
>> or modified state in one of the clients, the bricks will notify the
>> current owner of the inode to flush all pending changes. Once
>> completed, the new client will be granted exclusive (if it's a write
>> request) or shared (if it's a read request) access to the inode. The
>> former owner will leave the inode in the invalid state (if it's a
>> write request) or shared (if it's a read request).
>> Multiple clients can read a shared inode simultaneously, however if
>> one client needs exclusive access to the inode, all other clients
>> will need to set inode's state to invalid before granting exclusive
>> The only synchronization point needed is to make sure that all
>> bricks agree on the inode state and which client owns it. This can
>> be achieved without locking using a method similar to what I
>> implemented in the DFC translator.
>> Besides the lock-less architecture, the main advantage is that much
>> more aggressive caching strategies can be implemented very near to
>> the final user, increasing considerably the throughput of the file
>> system. Special care has to be taken with things than can fail on
>> background writes (basically brick space and user access rights).
>> Those should be handled appropiately on the client side to guarantee
>> future success of writes.
>> Of course this is only a high level overview. A deeper analysis
>> should be done to see what to do on each special case.
>> What do you think ?
> This sounds very much like "delegations and callbacks" in NFSv4. It is
> an optional feature that servers do not need to support, and some
> clients can not support easily (think of firewalls blocking callbacks).
> The RFC for NFSv4 documents the feature pretty well:
> - http://tools.ietf.org/html/rfc3530#section-9.2
I didn't know it, but it really seems very similar to the idea I had.
I'll read it in more detail.
> I'd surely be interested in seeing something similar for the GlusterFS
> protocol, it definitely improved performance for certain workloads on
More information about the Gluster-devel