[Gluster-devel] [RFC] A new caching/synchronization mechanism to speed up gluster

Tue Feb 4 09:07:22 UTC 2014

Hi,

currently, inodelk() and entrylk() are being used to make sure that 
changes happen synchronously on all bricks, avoiding data/metadata 
corruption when multiple clients modify the same inode concurrently. So 
far so good, however I think this introduces a significant overhead to 
avoid a situation that will happen very rarely. It also limits the 
advantage of client-side caches.

I propose to implement a new translator that uses a MESI-like protocol 
(protocol used to maintain memory coherency between local caches of CPU 
cores). This translator would add virtually 0 overhead when there isn't 
more than one client accessing the same inode, and an overhead 
comparable to current implementation if there is contention.

Another advantage of this protocol would be that it will be possible to 
implement much more aggressive caching mechanisms on the client side 
that will improve overall performance without losing any current features.

At a high level this is how it could work:

Each client tracks the state of each inode it uses (M - Modified, E - 
Exclusive, S - Shared, I - Invalid). All inodes will be created in the 
invalid state. When the client needs to write the inode, it asks all 
bricks exclusive access. Once granted, the inode will be in exclusive 
state and any read/write operation could be made locally on the client 
side, because it knows that nobody else will be modifying the inode. If 
the inode is successfully written (on the local cache), the state will 
change to modified. Eventually the changes will be sent to the bricks in 
background and the state will go back to exclusive, or invalid if the 
inode is not needed anymore.

Now, if another client needs to read or write the same inode, it will 
send a request to all bricks. If the inode is in the exclusive or 
modified state in one of the clients, the bricks will notify the current 
owner of the inode to flush all pending changes. Once completed, the new 
client will be granted exclusive (if it's a write request) or shared (if 
it's a read request) access to the inode. The former owner will leave 
the inode in the invalid state (if it's a write request) or shared (if 
it's a read request).

Multiple clients can read a shared inode simultaneously, however if one 
client needs exclusive access to the inode, all other clients will need 
to set inode's state to invalid before granting exclusive access.

The only synchronization point needed is to make sure that all bricks 
agree on the inode state and which client owns it. This can be achieved 
without locking using a method similar to what I implemented in the DFC 
translator.

Besides the lock-less architecture, the main advantage is that much more 
aggressive caching strategies can be implemented very near to the final 
user, increasing considerably the throughput of the file system. Special 
care has to be taken with things than can fail on background writes 
(basically brick space and user access rights). Those should be handled 
appropiately on the client side to guarantee future success of writes.

Of course this is only a high level overview. A deeper analysis should 
be done to see what to do on each special case.

What do you think ?

Xavi