[Gluster-devel] Gluster driver for Archipelago - Development process

Fri Dec 6 14:16:55 UTC 2013

Hi all,
χ
Since there are many people in gluster-devel that were not involved in
the initial mail exchange, I'll try to explain what we are talking about.

Me and the rest of the *@grnet.gr CC:ed people are working on an
open-source project called Archipelago:
http://www.synnefo.org/docs/archipelago/latest/index.html.

Archipelago is a distributed storage layer that provides thin,
snapshottable volumes which can be used by virtual machines (VMs) to
store their data. It is independent of the underlying storage technology
and has drivers for different backends. You can read below for a
high-level overview of Archipelago:

On 12/04/2013 08:29 AM, Vijay Bellur wrote:
> On 12/04/2013 06:45 AM, Anand Avati wrote:
>> On Tue, Dec 3, 2013 at 7:34 AM, Vijay Bellur <vbellur at redhat.com 
>> <mailto:vbellur at redhat.com>> wrote:
>> 
>> 
>>> An Archipelago volume is a COW volume, consisting of many 
>>> contiguous pieces (volume chunks), typically 4MB in size. It is 
>>> COW since it may share read-only chunks with other volumes (e.g. 
>>> if the volumes are created from the same OS image) and creates 
>>> new chunks to write to them. In order to refer to a chunk, we 
>>> assign a name to it (e.g. volume1_0002) which can be considered 
>>> as the object name (Rados) or file name (Gluster, NFS).
>> 
>>> The above logic is handled by separate Archipelago entities 
>>> (mappers, volume composers). This means that the storage driver’s
>>> only task is to read/write chunks from and to the storage. Also,
>>> given that there is one such driver per host - where 50 VMs can
>>> be running - means that it must handle a lot of chunks.
>>> 

So, what we are trying to achieve here is to create a high-performance
storage driver for Gluster. We have actually created a first version of
the driver[1] using the libgfapi library that is also used by the Qemu
driver and we are now discussing ways to improve its performance.

The driver functions that are most performance-sensitive are the
handle_read()/handle_write() functions. In a nutshell, the code flow of
these functions is the following.

<request context>
open() target file                                    (block *)
if it doesn't exist, create() the target file         (block)
async_read()/async_write() from/to target file        (non-block)
return

<callback context>
if interrupted, re-issue async_read()/async_write()   (non-block)
if OK, complete the I/O request and close() the file  (non-block)

To understand why asynchronism is important, consider the following:

>>> 
>>> Now, back to our storage driver and the need for full 
>>> asynchronism. When it receives a read/write request for a chunk, 
>>> it will generally need to open the file, create it if it doesn’t
>>>  exist, perform the I/O and finally close the file. Having 
>>> non-blocking read/write but blocking open/create/close 
>>> essentially makes this request a blocking request. This means 
>>> that if the driver supports e.g. 64 in-flight requests, it needs 
>>> to have 64 threads to be able to manage all of them.
>> 

Hopefully, the above should give enough context to understand the rest
of this discussion. From this point on, I will answer inline to Vijay
and Anand.

>> 
>>> open/create/close are not completely synchronous in gluster with 
>>> open-behind and write-behind translators loaded in the client side 
>>> graph. open-behind and write-behind translators by default are part
>>> of the client side graph in GlusterFS. With open-behind translator,
>>> open() system call is short circuited by GlusterFS and the actual
>>> open happens in the background. For create, an open with O_CREAT |
>>> O_EXCL flags would be handled by open-behind. Similarly an actual
>>> close is done in the background by the write-behind translator. As
>>> such, Archipelago should not experience significant latency with
>>> open & close operations.
>>> 
>> 
>> A create (O_CREAT with or without O_EXCL) is currently not handled 
>> by open-behind and will always be synchronous with a network round 
>> trip.
> 
> My bad that I overlooked this behavior for create. Alex - would it 
> work if we were to invoke a callback with user provided context for 
> asynchronous creates from libgfapi?
> 

Please see below.

> 
>> An open() on an existing file is cut-short by open-behind. But even
>> that might not be sufficient because the path resolver (lookup()
>> etc.) always works synchronously with network round-drops for path
>> based operations. We will need to design new APIs for true async
>> path based operations (with path resolver also executed 
>> asynchronously).
> 
> True asynchronous behavior would be good to have. If tests with 
> Archipelago do not show significant latency for open operations, we 
> can possibly defer this to phase 2 of our integration and go ahead 
> with the existing open-behind implementation for now.
> 

First, let's find in the read()/write() example I've posted above the
calls that may impact performance. I think that there are only two calls:

1) The open() call that will:
   a) block if the file is new or,
   b) "semi-block" until the path resolver returns.
2) The create() call, that occurs only after a failed open(), and will
block.

>From what I gather, the close() call is always non-blocking and network
operations are handled in the background, so it should be fast. Please
correct me if I have missed something.

The way I see it, all the above can become a no-issue if the following
two things happen:

1) Open() becomes async.
2) Open() supports O_CREAT (not sure if doable, you can comment on that)

This seems to me as a one-stone-two-birds kind of thing, meaning that
improving the open() solves the create issue. Also, an extra bonus is
that only one API call and one round-trip to the storage will be needed.

Granted, this may be a lot to do in one take so you can just as well
split it as you see fit, For example, implement the async open() first
and deal with the O_CREAT in the next phase (more about phases at the
end of this mail).

>> 
>> Let’s assume that open/create/close are indeed non-blocking or 
>> virtually nonexistent [1]. Most importantly, this would greatly 
>> reduce the read/write latency, especially for 4k requests. Another
>>  benefit is the ability to use a much smaller number of threads. 
>> However, besides read/write, there are also other operations that 
>> the driver must support such as stat()ing or deleting a file. If 
>> these operations are blocking, this means that a spurious delete 
>> and stat can stall our driver. Once more, it needs to have a lot of
>> threads to be operational.
>> 
>> 
>> Currently stat() and unlink() are synchronous too.
>> 
>> Note that internally all the operations in gluster are asynchronous
>> in their true nature. gfapi provides "convenience wrappers" for
>> these calls in a synchronous way. It is trivial to expose the
>> asynchronous calls through gfapi, but we haven't done so for the
>> path based operations because there hasn't been a need thus far.
>> And without even a single consumer, we did not want to reason about
>> the semantics of async path calls.
>> 
>> Do you have an example driver/header which shows the ideal behavior
>> for the async path based calls? Will you provide a context pointer
>> and expect to receive it in the callback? Or do you expect the API
>> to return a stub for the async call dispatch and poll on it?
>>

I think that the callback-based API that Gluster provides is the most
natural choice. Note that the stat()/unlink() are not in the performance
critical path, so they can be done last (see phases section).

As for the requested examples, you can see the example that I have
posted above or take a look on the code[1] (see the
handle_copy/handle_read/handle_write functions). All of these adhere
pretty much to the following routine:

callback_function()
{
	handle_function();
}

handle_function()
{
	if request == phase1:
		request = phase2;
		async_op(..., callback_function);
	if request == phase2:
		/* check if async_op has completed successfully */
		request = phase3;
		async_op(..., callback_function);
		.
		.
		.
	if request == phaseN:
		/* check if async_op has completed successfully */
		complete(request)
}

>>>>> 
>>>>> 2. There is no way to create notifications on a file (as Rados can 
>>>>> with its objects).
>>>>> 
>>>> 
>>>> How are these notifications consumed?
>>>> 
>>> 
>>> They are consumed by the lock/unlock operations that are also 
>>> handled by our driver. For instance, the Rados driver can wait 
>>> asynchronously for someone to unlock an object by registering a 
>>> watch to the object and a callback function. Conversely, the unlock
>>> operation makes sure to send a notification to all watchers of the
>>> object. Thus, the lock/unlock operation can happen asynchronously
>>> [2].
>>> 
>>> I have read that Gluster supports Posix locks, but this is not the 
>>> locking scheme we have in mind. We need a persistent type of lock 
>>> that would stay on a file even if the process closed the file 
>>> descriptor or worse, crashed.
>>> 
>> 
>> How would you recover if a process that held the lock goes away 
>> forever? We can provide an option to make this kind of behavior 
>> possible with posix-locks translator.
>> 

Please see below.

>>
>> Our current solution is to  create a “lock file” e.g. 
>> “volume1_0002_lock” with the owner name written in it. Thus, the 
>> lock/unlock operations generally happen as follows:
>> 
>> a) Lock: Try to exclusively create a lock file. If successful, 
>> write the owner id to it. If not, sleep for 1 second and retry. b)
>>  Unlock: Read a lock file and its owner. If we are the owner,
>> delete it. Else, fail.
>> 
>> As you can see, this is not an elegant way and is subject to race 
>> conditions. If Gluster can provide a better solution, we would be 
>> more than happy to know about it.
>> 
>> 
>> Currently our locking API exposed through gfapi is POSIX-like 
>> (synchronous, non-persistent). We have other internal locking 
>> mechanisms (entry-lk for locking an abstract name/string in a 
>> directory and inode-lk - nested range locks in a file) which are 
>> currently not exposed through gfapi. But these are not persistent 
>> either. Providing async API version of these calls is not hard (if
>>  we have a good understanding about things like whether the caller
>>  provides a context pointer or expects a call specific stub from 
>> the API etc). However I'm not sure (yet) how to provide persistent
>>  locks (and what the right behavior should be in the event of crash
>>  of the caller). You may be able to simulate (somewhat) persistent
>>  locks in the driver using a combination of sync/async locking APIs
>>  + xattrs.
> 
> The requirement from Archipelago seems to necessitate avoiding clean
>  up of locks during release/flush/disconnect. There are some 
> complexities that we will run into if we want to provide this 
> behavior. Understanding the lock recovery and cleanup semantics 
> better will help us in determining the right way out here.
> 

Like stat()/unlink() operations, the frequency of lock/unlock is rare,
so implementing persistent locks can be delegated to the second phase.
Also, to answer your questions, I will provide the current lock
requirements of Archipelago.

=== Archipelago lock requirements =========

Locks are required in Archipelago to regulate which storage driver (a
separate user-space process) can access the volume's data at any time.
If a second driver attempts to write in this volume, e.g. due to a VM
migration, it should be prevented to do so until the first storage
driver has unlocked the volume.

If a storage driver that holds a lock crashes, then the volume that was
locked *must* remain locked. This way, we can make sure that no-one will
be able to write to the volume, even if the volume migrates to another host.

Once the administrators have resolved the issue and the storage driver
is back online, they can instruct it to "break" the lock. For the RADOS
driver, there is already a rados_break_lock() call that releases the
lock. For the Gluster/NFS driver, we simply unlink() the lock_file.

Furthermore, Archipelago expects that it can put more than one *named*
locks in the same volume. These different locks can be used to
synchronize different storage drivers and should be either in shared or
exclusive mode.

Moreover, each lock should have an owner id, that can be generated
implicitly (e.g. a unique library client id) or explicitly i.e. an id
provided by the storage driver.

Finally, Archipelago expects that the lock operation can have block
semantics or try-once semantics. The unlock operation on the other hand
has try-once or break semantics.

===============================================

Almost all of the above (besides shared/exclusive locks and unique
library client ids) are "implemented" in the first version of this
driver[2] and are explained in the handle_acquire/handle_release
functions. This goes to show that Gluster can be tweaked to provide
support for persistent locks.

-----------------------------------------------------------------------

Summing up:

My opinion is that persistent locking is of secondary importance, as it
can be handled in application level. Async open with O_CREAT on the
other hand isn't.

Thus, I propose for the first phase to make the open() call non-blocking
and possibly add the O_CREAT flag. For our part, we can benchmark the
performance of the current driver and measure it against any
improvements that occur along the road.

For the second phase, we can work out a persistent locking scheme that
would benefit both parties.

Finally, the third phase can revolve around making the rest of the
blocking calls non-blocking.

So, what do you think?

Regards,
Alex

[1] To download the source code, you can do the following:

	git clone https://code.grnet.gr/git/archipelago
	cd archipelago
	git checkout feature-glusterd
	vi xseg/peers/user/glusterd.c

or view the source code directly from this URL:

https://code.grnet.gr/projects/archipelago/repository/revisions/feature-glusterd/entry/xseg/peers/user/glusterd.c

-- 
Alex | apyrgio at grnet.gr