[Gluster-devel] Gluster driver for Archipelago - Development process

Tue Dec 3 15:34:41 UTC 2013

Adding gluster-devel as there is a good amount of detail on the ongoing 
integration with Archipelago.

On 11/26/2013 03:12 PM, Alex Pyrgiotis wrote:
> Hi Vijay,
>
> I will answer your questions inline:
>
> On 11/21/2013 09:25 PM, Vijay Bellur wrote:
>> On 11/19/2013 10:51 PM, Alex Pyrgiotis wrote:
>>> [for the installation...] we consulted the Debian README [1] of
>>> GlusterFS  and downloaded the packages for Wheezy. Although the
>>> packages were installed properly, to the best of our knowledge we
>>> found no trace of the libgfapi library. Thus, we cloned the git
>>> repo, checked out the 3.4 branch and compiled
>>> liblglusterfs/libgfapi from there.
>>>
>>> <...snip...>
>>
>> I think libgfapi is part of glusterfs-common in Debian. But I will
>> work with our package maintainers to see if having a separate
>> package for libgfapi is possible.
>>
>
> Hm, that's interesting.
>
> When I was searching for the libgfapi library, I was simply looking for
> the header files. I searched a bit more into it and did a:
>
> dpkg -c glusterfs-common_3.4.1-2_amd64.deb
>
> Although there is no trace of the header files, I see that libgfapi.so
> is there. I searched about this issue and found this bug report:
>
> http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=717558
>
> My guess is that the fix of the Debian maintainers has not been
> backported to the packages you create for Wheezy. So, you probably need
> to add that to the packages you provide.
>
> In any case, I created a sole Gluster server/client and copied the
> header files by hand to check if the tests that I have posted pass. It
> seems that the first two tests complete successfully while the Python
> test still fails. I'll re-run them once I tear down and rebuild my
> cluster with the correct packages.
>
>>
>>> <...snip...>
>>>
>>> 1. There are no async operations for open/create/close/stat/unlink,
>>> which are necessary for various operations of Archipelago.
>>
>> Is there more description on how various operations of Archipelago
>> rely on async operations for open/create etc.? I must admit that I
>> haven't gone through your code but will definitely do so to get a
>> better understanding.
>>
>
> Sure, I 'll explain our rationale but first, let me provide some insight
> on the fundamental logic of Archipelago to understand the context on
> which we operate:
>
> An Archipelago volume is a COW volume, consisting of many contiguous
> pieces (volume chunks), typically 4MB in size. It is COW since it may
> share read-only chunks with other volumes (e.g. if the volumes are
> created from the same OS image) and creates new chunks to write to
> them. In order to refer to a chunk, we assign a name to it (e.g.
> volume1_0002) which can be considered as the object name (Rados) or file
> name (Gluster, NFS).
>
> The above logic is handled by separate Archipelago entities (mappers,
> volume composers). This means that the storage driver’s only task is to
> read/write chunks from and to the storage. Also, given that there is one
> such driver per host - where 50 VMs can be running - means that it must
> handle a lot of chunks.
>
> Now, back to our storage driver and the need for full asynchronism. When
> it receives a read/write request for a chunk, it will generally need to
> open the file, create it if it doesn’t exist, perform the I/O and
> finally close the file. Having non-blocking read/write but blocking
> open/create/close essentially makes this request a blocking request.
> This means that if the driver supports e.g. 64 in-flight requests, it
> needs to have 64 threads to be able to manage all of them.

open/create/close are not completely synchronous in gluster with 
open-behind and write-behind translators loaded in the client side 
graph. open-behind and write-behind translators by default are part of 
the client side graph in GlusterFS. With open-behind translator, open() 
system call is short circuited by GlusterFS and the actual open happens 
in the background. For create, an open with O_CREAT | O_EXCL flags would 
be handled by open-behind. Similarly an actual close is done in the 
background by the write-behind translator. As such, Archipelago should 
not experience significant latency with open & close operations.

>
> Let’s assume that open/create/close are indeed non-blocking or virtually
> nonexistent [1]. Most importantly, this would greatly reduce the
> read/write latency, especially for 4k requests. Another benefit is the
> ability to use a much smaller number of threads. However, besides
> read/write, there are also other operations that the driver must support
> such as stat()ing or deleting a file. If these operations are
> blocking, this means that a spurious delete and stat can stall our
> driver. Once more, it needs to have a lot of threads to be operational.

If the same file is being stat()'d, there is md-cache translator in the 
client stack which can prevent stat from making a round trip to the 
server and thereby provide lower latency for stat. Have you come across 
situations in your testing where the latency has been higher due to 
synchronous nature of stat requests? What is the frequency of unlink 
operations?

>
>>
>>> 2. There is no way to create notifications on a file (as Rados can
>>> with its objects).
>>
>> How are these notifications consumed?
>>
>
> They are consumed by the lock/unlock operations that are also handled by
> our driver. For instance, the Rados driver can wait asynchronously for
> someone to unlock an object by registering a watch to the object and a
> callback function. Conversely, the unlock operation makes sure to send a
> notification to all watchers of the object. Thus, the lock/unlock
> operation can happen asynchronously [2].
>
> I have read that Gluster supports Posix locks, but this is not the
> locking scheme we have in mind. We need a persistent type of lock that
> would stay on a file even if the process closed the file descriptor or
> worse, crashed.

How would you recover if a process that held the lock goes away forever? 
We can provide an option to make this kind of behavior possible with 
posix-locks translator.

> Our current solution is to  create a “lock file” e.g.
> “volume1_0002_lock” with the owner name written in it. Thus, the
> lock/unlock operations generally happen as follows:
>
> a) Lock: Try to exclusively create a lock file. If successful, write the
> owner id to it. If not, sleep for 1 second and retry.
> b) Unlock: Read a lock file and its owner. If we are the owner, delete
> it. Else, fail.
>
> As you can see, this is not an elegant way and is subject to race
> conditions. If Gluster can provide a better solution, we would be more
> than happy to know about it.

An asynchronous api for locks (callback upon lock granted/released) can 
be possibly implemented. Since you mention that the number of locking 
calls is not very significant, can we take this up after the first phase 
of integration is done with?

Regards,
Vijay

>
> Regards,
> Alex
>
>
>
> [1] In our NFS driver, we sidestep this issue by caching the file
> descriptors, so the blocking open/create/close will happen less
> frequently, provided we have a large enough cache. This however is not a
> reliable solution.
>
> [2] To be fair, currently Rados does not have an asynchronous
> lock/unlock function, so we spawn a thread that handles this task. Since
> locking/unlocking operations are scarce though, the driver's performance
> is not affected by it.
>