[Gluster-devel] [RFC] A new caching/synchronization mechanism to speed up gluster

Xavier Hernandez xhernandez at datalab.es
Mon Feb 10 08:46:47 UTC 2014

These are a few ideas I had about how to implement a MESI-like protocol 
on gluster. It's more a bunch of ideas than an structured proposal, 
however I hope it's clear enough to show the basic concepts as I see them.

Each inode will have two separate access levels: one for the metadata 
and one for the data. They could be different. Additionally, many 
security checks, posix compliance and some other aspects will be needed 
to be performed on the client side since many requests could be directly 
satisfied without accessing the bricks.

First the easy part: under normal circumstances without any node 
failures or other errors.

The main idea is that each client, before processing a request, will 
check if it has enough information from the related inode to process the 
request locally (i.e. at least shared access to the inode for read 
requests or exclusive access for writes). If it has enough information, 
the request will be immediately processed and returned to the upper 
translators and, for write requests, the operation will be continued in 

If the client doesn't have enough access to the inode, it can attach 
information to the request to tell the bricks which kind of access it 
wants for the inode. By default an operation needs a specific access 
level (i.e. shared access for reads and exclusive access for writes), 
however it can request a level less strict if the client won't need the 
full access in a near future (for example a write request will need 
exclusive access, and bricks will execute it with exclusive access, 
however the client can ask for shared access only if it foresees that 
following operations will only be reads). Additionally, for exclusive 
requests, a required space estimate must also be attached to the 
request. This value will be used by the bricks to reserve the amount of 
space requested for this client. This is needed to control available 
space on writes and allow them to be executed locally on the client side 
(when the available space on a brick gets too low, it can deny any 
exclusive access to have a better control of available space). It can 
specify access levels for more than one inode (this is useful for 
operations like rename that involve more than one inode). This 
information will be sent as new entries inside the xdata argument. Then 
the request will be sent to the bricks and the client will wait until it 
receives the answer. Bricks can answer in three ways:

1. The operation cannot be processed due to impossibility to get the 
desired access to the inode(s) - It shouldn't happen, but it must be 
taken into account
2. The operation has been processed successfully (even if the result of 
the operation is an error) but the desired level of access has not been 
3. The operation has been processed successfully and the desired level 
of access has been granted

When the operation succeeds and the request involved more than one 
inode, it might happen that the bricks grant access to one of them but 
not to the others. It's also possible that one brick grants access to 
one inode but another brick does not (for example if a bricks is in a 
very low space condition). In this case the client will consider that 
access has been denied.

When the access has been denied but the request has succeeded, it means 
that any future request involving the same inode will need to be sent to 
the bricks with the extra access information again.

This also gives enough control to the bricks to not grant exclusive 
access to some inode if it detects that multiple clients are accessing 
it concurrently.

All requests containing inode access information will need to be 
strictly ordered to guarantee that all bricks process the requests in 
the same order. Requests executed in background because the client 
already had exclusive access can be executed in any order (the exclusive 
access is enough to avoid corruptions).

Specific details about some fops:

* open(), opendir(). The open flags can be used to determine the desired 
access. An O_RDONLY open, will request 'shared' access. A O_RDWR or 
O_WRONLY will request 'exclusive' access. A O_WRONLY flag could also 
disable caching because it will never be used.
* When the last fd of an inode is released, the current ownership can be 
released (i.e. set the cache entry to 'invalid').
* Synchronization fops, like flush(), fsync() and fsyncdir(), will 
always be sent synchronously even if the client has exclusive access to 
the inode.

The not so easy part: If something fails.

The big problem is what to do when a client dies having exclusive access 
to some inodes or loses connection or a brick has any problem. There are 
a lot of cases and I haven't analyzed all of them deeply. This is only a 
first approach.

When a brick dies:

In this case all clients will cease to receive answers from it. This 
would need to be handled as it's currently done depending on the volume 
type (for replicate, the other bricks will maintain the volume working, 
for disperse, a part of the volume could be lost). When the bricks comes 
online again and reconnects, the current access levels owned by each 
client will need to be requested again (this is similar to the current 
procedure to reopen fd's). If any of the requests to restore ownership 
fail, the client will consider that it has lost the access to the inode 
and it will need to ask for it again in future requests.

When a client dies:

If it doesn't have ownership of any inode, nothing special happens. 
Otherwise, if it has 'exclusive' access to one or more inodes, all 
bricks will try to notify this client when another client requests 
'shared' or 'exclusive' access. This notification will have a timeout. 
If the client doesn't answer in the specified time, it will lose the 
ownership and all requests coming from that client without access 
information attached to the xdata will be denied. This can lead to some 
data loss, however, since the caching will be write-through and flush(), 
fsync() and fsyncdir() would have been executed synchronously, the 
likelihood of data loss is small and the semantics of posix allow it 
(I'm not a posix expert, but I think that posix doesn't guarantee data 
to be recoverable until flush() or fsync() have been executed 

When the client reconnects, it will continue to execute normally. It 
could receive some notification of invalidation of one inode that it 
doesn't have anymore. In this case it will simply acknowledge the 

When a client disconnects but it does not die:

It's basically the same than the above case, however when the client 
reconnects it will try to recover its previous ownerships. If nothing 
has changed, it will recover them. Otherwise some of the inodes will be 
invalidated. Any pending operations on the invalidated inodes will be 
lost (it's as if the client had died).


El 06/02/14 00:24, Anand Avati ha escrit:
> Xavi,
> Getting such a caching mechanism has several aspects. First of all we 
> need the framework pieces implemented (particularly server originated 
> messages to the client for invalidation and revokes) in a well 
> designed way. Particularly how we address a specific translator in a 
> message originating from the server. Some of the recent changes to 
> client_t allows for server-side translators to get a handle (the 
> client_t object) on which messages can be submitted back to the client.
> Such a framework (of having server originated messages) is also 
> necessary for implementing oplocks (and possibly leases) - 
> particularly interesting for the Samba integration.
> As Jeff already mentioned, this is an area where gluster has not 
> focussed on, given the targeted use case. However the benefits of 
> extending this to internal use cases (to avoid per-operation inodelks 
> can benefit many modules - encryption/crypt, afr, etc.) It seems 
> possible to have a common framework for delegating locks to clients, 
> and build caching coherency protocols / oplocks / inodelk avoidence on 
> top of it.
> Feel free to share a more detailed proposal if you have have/plan - 
> I'm sure the Samba folks (Ira copied) would be interested too.
> Thanks!
> Avati
> On Wed, Feb 5, 2014 at 11:27 AM, Xavier Hernandez 
> <xhernandez at datalab.es <mailto:xhernandez at datalab.es>> wrote:
>     On 04.02.2014 17:18, Jeff Darcy wrote:
>             The only synchronization point needed is to make sure that
>             all bricks
>             agree on the inode state and which client owns it. This
>             can be achieved
>             without locking using a method similar to what I
>             implemented in the DFC
>             translator. Besides the lock-less architecture, the main
>             advantage is
>             that much more aggressive caching strategies can be
>             implemented very
>             near to the final user, increasing considerably the
>             throughput of the
>             file system. Special care has to be taken with things than
>             can fail on
>             background writes (basically brick space and user access
>             rights). Those
>             should be handled appropiately on the client side to
>             guarantee future
>             success of writes. Of course this is only a high level
>             overview. A
>             deeper analysis should be done to see what to do on each
>             special case.
>             What do you think ?
>         I think this is a great idea for where we can go - and need to
>         go - in the
>         long term. However, it's important to recognize that it *is*
>         the long
>         term. We had to solve almost exactly the same problems in MPFS
>         long ago.
>         Whether the synchronization uses locks or not *locally* is
>         meaningless,
>         because all of the difficult problems have to do with
>         recovering the
>         *distributed* state. What happens when a brick fails while
>         holding an
>         inode in any state but I? How do we recognize it, what do we
>         do about it,
>         how do we handle the case where it comes back and needs to
>         re-acquire its
>         previous state? How do we make sure that a brick can
>         successfully flush
>         everything it needs to before it yields a lock/lease/whatever?
>         That's
>         going to require some kind of flow control, which is itself a
>         pretty big
>         project. It's not impossible, but it took multiple people some
>         years for
>         MPFS, and ditto for every other project (e.g. Ceph or
>         XtreemFS) which
>         adopted similar approaches. GlusterFS's historical avoidance
>         of this
>         complexity certainly has some drawbacks, but it has also been
>         key to us
>         making far more progress in other areas.
>     Well, it's true that there will be a lot of tricky cases that will
>     need
>     to be handled to be sure that data integrity and system
>     responsiveness is
>     guaranteed, however I think that they are not more difficult than what
>     can happen currently if a client dies or loses communication while it
>     holds a lock on a file.
>     Anyway I think there is a great potential with this mechanism
>     because it
>     can allow the implementation of powefull caches, even based on SSD
>     that
>     could improve the performance a lot.
>     Of course there is a lot of work solving all potential failures and
>     designing the right thing. An important consideration is that all
>     these methods try to solve a problem that is seldom found (i.e. having
>     more than one client modifying the same file at the same time). So a
>     solution that has almost 0 overhead for the normal case and allows the
>     implementation of aggressive caching mechanisms seems a big win.
>         To move forward on this, I think we need a *much* more
>         detailed idea of
>         how we're going to handle the nasty cases. Would some sort of
>         online
>         collaboration - e.g. Hangouts - make more sense than
>         continuing via
>         email?
>     Of course, we can talk on irc or another place if you prefer
>     Xavi
>     _______________________________________________
>     Gluster-devel mailing list
>     Gluster-devel at nongnu.org <mailto:Gluster-devel at nongnu.org>
>     https://lists.nongnu.org/mailman/listinfo/gluster-devel

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-devel/attachments/20140210/dfc1b245/attachment-0001.html>

More information about the Gluster-devel mailing list