[Gluster-devel] [RFC] A new caching/synchronization mechanism to speed up gluster

Xavier Hernandez xhernandez at datalab.es
Thu Feb 6 08:54:03 UTC 2014

El 06/02/14 08:51, Vijay Bellur ha escrit:
> On 02/06/2014 05:46 AM, Ira Cooper wrote:
>> Yep... this is an area I am very interested in, going forwards.
>> Especially sending messages back, we'll need that for any
>> caching/leasing/oplock/whatever we call it type protocols.
> +1. I am very interested in seeing this implemented.
> Xavi: Would you be able to attend next week's IRC meeting so that we 
> can discuss this further? Of course, we can have out of band 
> conversations but the IRC meeting might be a common ground for all 
> interested folks to get together.
Of course, I'll be there.


> -Vijay
>> Keep me in the loop, and I'll keep tracking the list.  (I'm already on
>> the list.)
>> I'm also ira on freenode if you want to find me.
>> Thanks,
>> -Ira / ira@(samba.org <http://samba.org>|redhat.com <http://redhat.com>)
>> On Wed, Feb 5, 2014 at 6:24 PM, Anand Avati <avati at gluster.org
>> <mailto:avati at gluster.org>> wrote:
>>     Xavi,
>>     Getting such a caching mechanism has several aspects. First of all
>>     we need the framework pieces implemented (particularly server
>>     originated messages to the client for invalidation and revokes) in a
>>     well designed way. Particularly how we address a specific translator
>>     in a message originating from the server. Some of the recent changes
>>     to client_t allows for server-side translators to get a handle (the
>>     client_t object) on which messages can be submitted back to the 
>> client.
>>     Such a framework (of having server originated messages) is also
>>     necessary for implementing oplocks (and possibly leases) -
>>     particularly interesting for the Samba integration.
>>     As Jeff already mentioned, this is an area where gluster has not
>>     focussed on, given the targeted use case. However the benefits of
>>     extending this to internal use cases (to avoid per-operation
>>     inodelks can benefit many modules - encryption/crypt, afr, etc.) It
>>     seems possible to have a common framework for delegating locks to
>>     clients, and build caching coherency protocols / oplocks / inodelk
>>     avoidence on top of it.
>>     Feel free to share a more detailed proposal if you have have/plan -
>>     I'm sure the Samba folks (Ira copied) would be interested too.
>>     Thanks!
>>     Avati
>>     On Wed, Feb 5, 2014 at 11:27 AM, Xavier Hernandez
>>     <xhernandez at datalab.es <mailto:xhernandez at datalab.es>> wrote:
>>         On 04.02.2014 17:18, Jeff Darcy wrote:
>>                 The only synchronization point needed is to make sure
>>                 that all bricks
>>                 agree on the inode state and which client owns it. This
>>                 can be achieved
>>                 without locking using a method similar to what I
>>                 implemented in the DFC
>>                 translator. Besides the lock-less architecture, the main
>>                 advantage is
>>                 that much more aggressive caching strategies can be
>>                 implemented very
>>                 near to the final user, increasing considerably the
>>                 throughput of the
>>                 file system. Special care has to be taken with things
>>                 than can fail on
>>                 background writes (basically brick space and user access
>>                 rights). Those
>>                 should be handled appropiately on the client side to
>>                 guarantee future
>>                 success of writes. Of course this is only a high level
>>                 overview. A
>>                 deeper analysis should be done to see what to do on each
>>                 special case.
>>                 What do you think ?
>>             I think this is a great idea for where we can go - and need
>>             to go - in the
>>             long term. However, it's important to recognize that it *is*
>>             the long
>>             term. We had to solve almost exactly the same problems in
>>             MPFS long ago.
>>             Whether the synchronization uses locks or not *locally* is
>>             meaningless,
>>             because all of the difficult problems have to do with
>>             recovering the
>>             *distributed* state. What happens when a brick fails while
>>             holding an
>>             inode in any state but I? How do we recognize it, what do we
>>             do about it,
>>             how do we handle the case where it comes back and needs to
>>             re-acquire its
>>             previous state? How do we make sure that a brick can
>>             successfully flush
>>             everything it needs to before it yields a
>>             lock/lease/whatever? That's
>>             going to require some kind of flow control, which is itself
>>             a pretty big
>>             project. It's not impossible, but it took multiple people
>>             some years for
>>             MPFS, and ditto for every other project (e.g. Ceph or
>>             XtreemFS) which
>>             adopted similar approaches. GlusterFS's historical avoidance
>>             of this
>>             complexity certainly has some drawbacks, but it has also
>>             been key to us
>>             making far more progress in other areas.
>>         Well, it's true that there will be a lot of tricky cases that
>>         will need
>>         to be handled to be sure that data integrity and system
>>         responsiveness is
>>         guaranteed, however I think that they are not more difficult
>>         than what
>>         can happen currently if a client dies or loses communication
>>         while it
>>         holds a lock on a file.
>>         Anyway I think there is a great potential with this mechanism
>>         because it
>>         can allow the implementation of powefull caches, even based on
>>         SSD that
>>         could improve the performance a lot.
>>         Of course there is a lot of work solving all potential 
>> failures and
>>         designing the right thing. An important consideration is that 
>> all
>>         these methods try to solve a problem that is seldom found (i.e.
>>         having
>>         more than one client modifying the same file at the same 
>> time). So a
>>         solution that has almost 0 overhead for the normal case and
>>         allows the
>>         implementation of aggressive caching mechanisms seems a big win.
>>             To move forward on this, I think we need a *much* more
>>             detailed idea of
>>             how we're going to handle the nasty cases. Would some sort
>>             of online
>>             collaboration - e.g. Hangouts - make more sense than
>>             continuing via
>>             email?
>>         Of course, we can talk on irc or another place if you prefer
>>         Xavi
>>         _________________________________________________
>>         Gluster-devel mailing list
>>         Gluster-devel at nongnu.org <mailto:Gluster-devel at nongnu.org>
>> https://lists.nongnu.org/__mailman/listinfo/gluster-devel
>> <https://lists.nongnu.org/mailman/listinfo/gluster-devel>
>> _______________________________________________
>> Gluster-devel mailing list
>> Gluster-devel at nongnu.org
>> https://lists.nongnu.org/mailman/listinfo/gluster-devel

More information about the Gluster-devel mailing list