[Gluster-devel] [RFC] A new caching/synchronization mechanism to speed up gluster
Xavier Hernandez
xhernandez at datalab.es
Thu Feb 6 08:51:43 UTC 2014
Hi Avati,
El 06/02/14 00:24, Anand Avati ha escrit:
> Xavi,
> Getting such a caching mechanism has several aspects. First of all we
> need the framework pieces implemented (particularly server originated
> messages to the client for invalidation and revokes) in a well
> designed way. Particularly how we address a specific translator in a
> message originating from the server. Some of the recent changes to
> client_t allows for server-side translators to get a handle (the
> client_t object) on which messages can be submitted back to the client.
>
> Such a framework (of having server originated messages) is also
> necessary for implementing oplocks (and possibly leases) -
> particularly interesting for the Samba integration.
>
Yes, that is a basic requirement for many features. I saw the client_t
changes but haven't had time to see if they could be used to implement
the kind of mechanism I proposed. This will need a look.
When I started implementing the DFC translator
(https://forge.gluster.org/disperse/dfc) I needed something very similar
but at that time there wasn't any suitable client_t implementation I
could use. I solved it by using a pool of special getxattr requests that
the translator on the bricks stores until it needs to send some message
back to the client. It's not a great solution but it works with the
available resources at the moment.
> As Jeff already mentioned, this is an area where gluster has not
> focussed on, given the targeted use case. However the benefits of
> extending this to internal use cases (to avoid per-operation inodelks
> can benefit many modules - encryption/crypt, afr, etc.) It seems
> possible to have a common framework for delegating locks to clients,
> and build caching coherency protocols / oplocks / inodelk avoidence on
> top of it.
>
> Feel free to share a more detailed proposal if you have have/plan -
> I'm sure the Samba folks (Ira copied) would be interested too.
I have some ideas on how to implement it and some special cases, but I
need to work more on it before it can be considered a valid model. I
just wanted to propose the idea to see if it could be valid or not
before spending too much of my scarce time working on it. I'll try to
get a more detailed picture to discuss it.
Best regards,
Xavi
>
> Thanks!
> Avati
>
>
> On Wed, Feb 5, 2014 at 11:27 AM, Xavier Hernandez
> <xhernandez at datalab.es <mailto:xhernandez at datalab.es>> wrote:
>
> On 04.02.2014 17:18, Jeff Darcy wrote:
>
> The only synchronization point needed is to make sure that
> all bricks
> agree on the inode state and which client owns it. This
> can be achieved
> without locking using a method similar to what I
> implemented in the DFC
> translator. Besides the lock-less architecture, the main
> advantage is
> that much more aggressive caching strategies can be
> implemented very
> near to the final user, increasing considerably the
> throughput of the
> file system. Special care has to be taken with things than
> can fail on
> background writes (basically brick space and user access
> rights). Those
> should be handled appropiately on the client side to
> guarantee future
> success of writes. Of course this is only a high level
> overview. A
> deeper analysis should be done to see what to do on each
> special case.
> What do you think ?
>
>
> I think this is a great idea for where we can go - and need to
> go - in the
> long term. However, it's important to recognize that it *is*
> the long
> term. We had to solve almost exactly the same problems in MPFS
> long ago.
> Whether the synchronization uses locks or not *locally* is
> meaningless,
> because all of the difficult problems have to do with
> recovering the
> *distributed* state. What happens when a brick fails while
> holding an
> inode in any state but I? How do we recognize it, what do we
> do about it,
> how do we handle the case where it comes back and needs to
> re-acquire its
> previous state? How do we make sure that a brick can
> successfully flush
> everything it needs to before it yields a lock/lease/whatever?
> That's
> going to require some kind of flow control, which is itself a
> pretty big
> project. It's not impossible, but it took multiple people some
> years for
> MPFS, and ditto for every other project (e.g. Ceph or
> XtreemFS) which
> adopted similar approaches. GlusterFS's historical avoidance
> of this
> complexity certainly has some drawbacks, but it has also been
> key to us
> making far more progress in other areas.
>
> Well, it's true that there will be a lot of tricky cases that will
> need
> to be handled to be sure that data integrity and system
> responsiveness is
> guaranteed, however I think that they are not more difficult than what
> can happen currently if a client dies or loses communication while it
> holds a lock on a file.
>
> Anyway I think there is a great potential with this mechanism
> because it
> can allow the implementation of powefull caches, even based on SSD
> that
> could improve the performance a lot.
>
> Of course there is a lot of work solving all potential failures and
> designing the right thing. An important consideration is that all
> these methods try to solve a problem that is seldom found (i.e. having
> more than one client modifying the same file at the same time). So a
> solution that has almost 0 overhead for the normal case and allows the
> implementation of aggressive caching mechanisms seems a big win.
>
>
> To move forward on this, I think we need a *much* more
> detailed idea of
> how we're going to handle the nasty cases. Would some sort of
> online
> collaboration - e.g. Hangouts - make more sense than
> continuing via
> email?
>
> Of course, we can talk on irc or another place if you prefer
>
> Xavi
>
>
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at nongnu.org <mailto:Gluster-devel at nongnu.org>
> https://lists.nongnu.org/mailman/listinfo/gluster-devel
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-devel/attachments/20140206/d521f265/attachment-0001.html>
More information about the Gluster-devel
mailing list