[Gluster-devel] [RFC] A new caching/synchronization mechanism to speed up gluster

Wed Feb 5 19:27:34 UTC 2014

On 04.02.2014 17:18, Jeff Darcy wrote:

>> The only synchronization point needed is to make sure that all 
>> bricks
>> agree on the inode state and which client owns it. This can be 
>> achieved
>> without locking using a method similar to what I implemented in the 
>> DFC
>> translator. Besides the lock-less architecture, the main advantage 
>> is
>> that much more aggressive caching strategies can be implemented very
>> near to the final user, increasing considerably the throughput of 
>> the
>> file system. Special care has to be taken with things than can fail 
>> on
>> background writes (basically brick space and user access rights). 
>> Those
>> should be handled appropiately on the client side to guarantee 
>> future
>> success of writes. Of course this is only a high level overview. A
>> deeper analysis should be done to see what to do on each special 
>> case.
>> What do you think ?
>
> I think this is a great idea for where we can go - and need to go - 
> in the
> long term. However, it's important to recognize that it *is* the long
> term. We had to solve almost exactly the same problems in MPFS long 
> ago.
> Whether the synchronization uses locks or not *locally* is 
> meaningless,
> because all of the difficult problems have to do with recovering the
> *distributed* state. What happens when a brick fails while holding an
> inode in any state but I? How do we recognize it, what do we do about 
> it,
> how do we handle the case where it comes back and needs to re-acquire 
> its
> previous state? How do we make sure that a brick can successfully 
> flush
> everything it needs to before it yields a lock/lease/whatever? That's
> going to require some kind of flow control, which is itself a pretty 
> big
> project. It's not impossible, but it took multiple people some years 
> for
> MPFS, and ditto for every other project (e.g. Ceph or XtreemFS) which
> adopted similar approaches. GlusterFS's historical avoidance of this
> complexity certainly has some drawbacks, but it has also been key to 
> us
> making far more progress in other areas.
>
Well, it's true that there will be a lot of tricky cases that will need
to be handled to be sure that data integrity and system responsiveness 
is
guaranteed, however I think that they are not more difficult than what
can happen currently if a client dies or loses communication while it
holds a lock on a file.

Anyway I think there is a great potential with this mechanism because 
it
can allow the implementation of powefull caches, even based on SSD that
could improve the performance a lot.

Of course there is a lot of work solving all potential failures and
designing the right thing. An important consideration is that all
these methods try to solve a problem that is seldom found (i.e. having
more than one client modifying the same file at the same time). So a
solution that has almost 0 overhead for the normal case and allows the
implementation of aggressive caching mechanisms seems a big win.

> To move forward on this, I think we need a *much* more detailed idea 
> of
> how we're going to handle the nasty cases. Would some sort of online
> collaboration - e.g. Hangouts - make more sense than continuing via
> email?
>
Of course, we can talk on irc or another place if you prefer

Xavi