[Gluster-devel] [RFC] A new caching/synchronization mechanism to speed up gluster
Xavier Hernandez
xhernandez at datalab.es
Tue Feb 4 09:07:22 UTC 2014
Hi,
currently, inodelk() and entrylk() are being used to make sure that
changes happen synchronously on all bricks, avoiding data/metadata
corruption when multiple clients modify the same inode concurrently. So
far so good, however I think this introduces a significant overhead to
avoid a situation that will happen very rarely. It also limits the
advantage of client-side caches.
I propose to implement a new translator that uses a MESI-like protocol
(protocol used to maintain memory coherency between local caches of CPU
cores). This translator would add virtually 0 overhead when there isn't
more than one client accessing the same inode, and an overhead
comparable to current implementation if there is contention.
Another advantage of this protocol would be that it will be possible to
implement much more aggressive caching mechanisms on the client side
that will improve overall performance without losing any current features.
At a high level this is how it could work:
Each client tracks the state of each inode it uses (M - Modified, E -
Exclusive, S - Shared, I - Invalid). All inodes will be created in the
invalid state. When the client needs to write the inode, it asks all
bricks exclusive access. Once granted, the inode will be in exclusive
state and any read/write operation could be made locally on the client
side, because it knows that nobody else will be modifying the inode. If
the inode is successfully written (on the local cache), the state will
change to modified. Eventually the changes will be sent to the bricks in
background and the state will go back to exclusive, or invalid if the
inode is not needed anymore.
Now, if another client needs to read or write the same inode, it will
send a request to all bricks. If the inode is in the exclusive or
modified state in one of the clients, the bricks will notify the current
owner of the inode to flush all pending changes. Once completed, the new
client will be granted exclusive (if it's a write request) or shared (if
it's a read request) access to the inode. The former owner will leave
the inode in the invalid state (if it's a write request) or shared (if
it's a read request).
Multiple clients can read a shared inode simultaneously, however if one
client needs exclusive access to the inode, all other clients will need
to set inode's state to invalid before granting exclusive access.
The only synchronization point needed is to make sure that all bricks
agree on the inode state and which client owns it. This can be achieved
without locking using a method similar to what I implemented in the DFC
translator.
Besides the lock-less architecture, the main advantage is that much more
aggressive caching strategies can be implemented very near to the final
user, increasing considerably the throughput of the file system. Special
care has to be taken with things than can fail on background writes
(basically brick space and user access rights). Those should be handled
appropiately on the client side to guarantee future success of writes.
Of course this is only a high level overview. A deeper analysis should
be done to see what to do on each special case.
What do you think ?
Xavi
More information about the Gluster-devel
mailing list