[Gluster-devel] [RFC] A new caching/synchronization mechanism to speed up gluster

Thu Feb 6 08:51:43 UTC 2014

Hi Avati,

El 06/02/14 00:24, Anand Avati ha escrit:
> Xavi,
> Getting such a caching mechanism has several aspects. First of all we 
> need the framework pieces implemented (particularly server originated 
> messages to the client for invalidation and revokes) in a well 
> designed way. Particularly how we address a specific translator in a 
> message originating from the server. Some of the recent changes to 
> client_t allows for server-side translators to get a handle (the 
> client_t object) on which messages can be submitted back to the client.
>
> Such a framework (of having server originated messages) is also 
> necessary for implementing oplocks (and possibly leases) - 
> particularly interesting for the Samba integration.
>
Yes, that is a basic requirement for many features. I saw the client_t 
changes but haven't had time to see if they could be used to implement 
the kind of mechanism I proposed. This will need a look.

When I started implementing the DFC translator 
(https://forge.gluster.org/disperse/dfc) I needed something very similar 
but at that time there wasn't any suitable client_t implementation I 
could use. I solved it by using a pool of special getxattr requests that 
the translator on the bricks stores until it needs to send some message 
back to the client. It's not a great solution but it works with the 
available resources at the moment.

> As Jeff already mentioned, this is an area where gluster has not 
> focussed on, given the targeted use case. However the benefits of 
> extending this to internal use cases (to avoid per-operation inodelks 
> can benefit many modules - encryption/crypt, afr, etc.) It seems 
> possible to have a common framework for delegating locks to clients, 
> and build caching coherency protocols / oplocks / inodelk avoidence on 
> top of it.
>
> Feel free to share a more detailed proposal if you have have/plan - 
> I'm sure the Samba folks (Ira copied) would be interested too.
I have some ideas on how to implement it and some special cases, but I 
need to work more on it before it can be considered a valid model. I 
just wanted to propose the idea to see if it could be valid or not 
before spending too much of my scarce time working on it. I'll try to 
get a more detailed picture to discuss it.

Best regards,

Xavi

>
> Thanks!
> Avati
>
>
> On Wed, Feb 5, 2014 at 11:27 AM, Xavier Hernandez 
> <xhernandez at datalab.es <mailto:xhernandez at datalab.es>> wrote:
>
>     On 04.02.2014 17:18, Jeff Darcy wrote:
>
>             The only synchronization point needed is to make sure that
>             all bricks
>             agree on the inode state and which client owns it. This
>             can be achieved
>             without locking using a method similar to what I
>             implemented in the DFC
>             translator. Besides the lock-less architecture, the main
>             advantage is
>             that much more aggressive caching strategies can be
>             implemented very
>             near to the final user, increasing considerably the
>             throughput of the
>             file system. Special care has to be taken with things than
>             can fail on
>             background writes (basically brick space and user access
>             rights). Those
>             should be handled appropiately on the client side to
>             guarantee future
>             success of writes. Of course this is only a high level
>             overview. A
>             deeper analysis should be done to see what to do on each
>             special case.
>             What do you think ?
>
>
>         I think this is a great idea for where we can go - and need to
>         go - in the
>         long term. However, it's important to recognize that it *is*
>         the long
>         term. We had to solve almost exactly the same problems in MPFS
>         long ago.
>         Whether the synchronization uses locks or not *locally* is
>         meaningless,
>         because all of the difficult problems have to do with
>         recovering the
>         *distributed* state. What happens when a brick fails while
>         holding an
>         inode in any state but I? How do we recognize it, what do we
>         do about it,
>         how do we handle the case where it comes back and needs to
>         re-acquire its
>         previous state? How do we make sure that a brick can
>         successfully flush
>         everything it needs to before it yields a lock/lease/whatever?
>         That's
>         going to require some kind of flow control, which is itself a
>         pretty big
>         project. It's not impossible, but it took multiple people some
>         years for
>         MPFS, and ditto for every other project (e.g. Ceph or
>         XtreemFS) which
>         adopted similar approaches. GlusterFS's historical avoidance
>         of this
>         complexity certainly has some drawbacks, but it has also been
>         key to us
>         making far more progress in other areas.
>
>     Well, it's true that there will be a lot of tricky cases that will
>     need
>     to be handled to be sure that data integrity and system
>     responsiveness is
>     guaranteed, however I think that they are not more difficult than what
>     can happen currently if a client dies or loses communication while it
>     holds a lock on a file.
>
>     Anyway I think there is a great potential with this mechanism
>     because it
>     can allow the implementation of powefull caches, even based on SSD
>     that
>     could improve the performance a lot.
>
>     Of course there is a lot of work solving all potential failures and
>     designing the right thing. An important consideration is that all
>     these methods try to solve a problem that is seldom found (i.e. having
>     more than one client modifying the same file at the same time). So a
>     solution that has almost 0 overhead for the normal case and allows the
>     implementation of aggressive caching mechanisms seems a big win.
>
>
>         To move forward on this, I think we need a *much* more
>         detailed idea of
>         how we're going to handle the nasty cases. Would some sort of
>         online
>         collaboration - e.g. Hangouts - make more sense than
>         continuing via
>         email?
>
>     Of course, we can talk on irc or another place if you prefer
>
>     Xavi
>
>
>     _______________________________________________
>     Gluster-devel mailing list
>     Gluster-devel at nongnu.org <mailto:Gluster-devel at nongnu.org>
>     https://lists.nongnu.org/mailman/listinfo/gluster-devel
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-devel/attachments/20140206/d521f265/attachment-0001.html>