[Gluster-devel] Can I bring a development idea to Dev's attention?

Fri Sep 24 10:07:20 UTC 2010

  On 24/09/2010 08:25, Shehjar Tikoo wrote:
> Thanks, we have similar locking improvements in mind but cannot 
> promise a date when these will be available. Some of the challenges 
> that we'll need to think about are how to map any such locking scheme 
> to standard locking behaviour for posix/nfsv3/v4/cifs.

Hi, thanks for replying

Whilst I can see that there is some optimisation to be had by combining 
brick level locking with filesystem level locking, I just want to 
clarify that my proposal was really about intra-brick locks, and not 
really about the toplevel filesystem level locking?

Just to clarify (and apologies if I'm trying to teach filesystem experts 
really obvious stuff...)

- The goal of any fileserver is to take async requests from lots of 
clients and arrange to serialise that access
- Up until recently such fileservers have been on a single machine, but 
involving multiple async clients connecting
- Even in a single server solution the bottleneck becomes that each 
client cannot cache *any* data since it's not known if the server copy 
has changed since we accessed it (even a microsecond earlier)
- The solution which has become popular (see CIFS, NFSV4 (?), GFS2, etc) 
was to offer clients an "optimistic lock", ie the client can acquire a 
token which while it's held means that it can cache data locked by that 
token and even offer writeback optimisations on that data (obviously 
subject to whatever the application tolerates for unsync'ed data)
- This "optimistic lock" means that we effectively push the file locking 
to the client, hence once a lock is acquired then further access by the 
client is no longer bounded by the network access latency, under many 
circumstances this leads to massive speedups
- Clearly when a second client comes along and demands access to the 
same data then we need a process to break the lock and inform the first 
client that they need to reacquire the lock (or revert to a kind of 
"write-through" access system while waiting)

So this process clearly benefits situations where there is serialised 
access by single clients at a time.  Excluding databases however, this 
access pattern seems quite common for lots of applications

So with regards to Gluster I would see that we need this same type of 
locking implemented at the brick level.  Hence if you re-read the 
description above, then each *gluster server* would be the possible 
clients (think of the lower level being bricks talking to each other, 
and the upper level being clients talking to bricks).  ie yes, posix 
locking needs to serialise access to every end client that connects to 
every brick, but we can also benefit from locking to serialise access 
between bricks (if 3,000 clients hammer one brick for a single file, 
then we care that our single brick is allowed to read/write that file 
freely because it informed the other bricks that it now holds a lock, 
it's a separate problem to serialise all the clients talking to the one 
brick)

So compared with traditional fileservers we actually need two levels of 
locking to serialise access.  At one level we need to serialise clients 
access to the filesystem, and lower down we need to serialise access 
between bricks

I think an alternative way of looking (and perhaps implementing) the 
situation could be something like:

- Consider two bricks with files replicated between them
- Client 1 accesses Brick 1 and requests File A
- Brick 1 contacts the other replicas and requests to become the "master 
replica" of that file. All future accesses to that file must now go 
through only Brick 1 while it remains in that "role"
- If Client 2 accesses Brick 1 and tries to do something with File A, 
then the normal filesystem locking must arrange for serialisation 
between Client 1 and Client 2, however, Brick 1 need not contact any 
other brick and there is no network latency penalty serving that file to 
Client 2 (obviously at some point one client will write data and we need 
to sync that, but read access incurs no network access)

- OK, now the trick is what happens when Client 3 accesses Brick 2 and 
requests File A...  Somehow we need to wrest control back from Brick 1 
and inform it that it's no longer the "master".  A really simple 
solution to this (at least conceptually) is to proxy all access requests 
from Brick 2 back to Brick 1.  This satisfies our requirement that 
accesses are serialised across bricks and effectively there is still a 
"master" brick remaining in control.
- We can see that this setup is conceptually similar to having a 
traditional lock server arbitrating brick access to a given file, but in 
example above we have implemented a distributed lock server, the lock 
server effectively becoming the same server as what we hope is the "hot 
server", so that we aren't incuring network latency to contact the lock 
server all the time.
- A further improvement would clearly be to have some kind of process 
where the "master brick" can move about, ie in the case above if Client 
3 starts to bash away at Brick 2 for File A, then Brick 2 is migrated to 
become the "master" and hold the lock, now any access through Brick 1 
must effectively proxy requests back to Brick 2 or re-acquire it's lock 
(ie become the master)

OK, so the above is a very simple example of optimistic locking and 
could be trivially implemented using an external lock server which 
tracks which brick currently holds the lock for a given file (ie can 
read/write freely without first checking if other bricks have modified 
the file).  A given brick which doesn't hold a lock on a file must first 
do kind of what it does already and contact the lock server to see if 
another brick holds the lock.  If not it can acquire the lock itself.  
If the lock is held elsewhere we either need to break the lock (or proxy 
access requests to the server holding the lock).

Really this is not so different to what is there today, but it's simply 
an efficiency improvement because we don't need to touch *every* brick 
for *every* file access, instead we make some network requests on first 
access to a file and then can continue to touch that file for a period 
afterwards without needing further network access with other bricks

However, whilst some kind of implementation of the above could offer a 
huge performance speedup for many of the situations which come up on the 
mailing list, the issue is that the lock server becomes a) a bottleneck 
and b) point of failure.  So the chain of thought almost certainly goes 
something like:

- Make the gluster bricks become the lock servers, ie they negotiate 
amongst themselves. Really this is roughly what happens right now, only 
it's on every access, rather than access being "sticky" once acquired
- Now analyse all the corner cases that bricks go down holding locks, or 
get segmented while holding/acquiring locks and discover some tricky 
issues...

Paxos seems like a clever way of dealing with the locking going 
distributed, yet not necessarily having a 100% consistent view of who 
owns which lock.  By introducing a voting method it can show robustness 
in the face of failed machines and new machines can be added without 
needing to store reliable state information (or at least this is true 
with the improvements described in the articles)

Does that make sense?  Apologies if the above is long winded, but the 
point is really that the performance improvements come from pushing 
locks between bricks, and probably this is distinct from client level 
locking such as nfs/cifs/posix, etc locking

For advanced cluster filesystems such as GFS2, the general "optimistic 
locking" technique appears to show massive speed improvements (for many 
access patterns) and it's also likely to do so in Gluster.  Really my 
original email jumped two steps and suggested an improved form of 
distributed locking, which itself could be used as the actual 
implementation, but other forms of distributed locking between bricks 
would be highly desirable also.

Thanks for listening

Ed W