[Gluster-devel] Feature help

Tue Nov 4 22:55:45 UTC 2014

responses inline ...

On Tue, Nov 4, 2014 at 8:00 AM, Ben England <bengland at redhat.com> wrote:
> inline...
>
>>
>
> I have a parallel-libgfapi benchmark that could be modified to fit the new API, and could test performance of it.
>
> https://github.com/bengland2/parallel-libgfapi
>

Not familiar with it however will look into it and play with it to see
how to use it.

>>
>
> libgfapi is a good place to prototype, it's easy to change libgfapi by adding to the existing calls, but this won't help performance as much as you might want unless the Gluster protocol can somehow change to allow combination of several separate FOPS such as LOOKUP, OPEN, READ and RELEASE FOPS and LOOKUP, CREATE, WRITE and RELEASE FOPS.   That's the hard part IMHO.  I suggest using wireshark to watch Gluster small-file creates, and then try to understand what each FOP is doing and why it is there.
>

I tried to do some wireshark captures by having a few bricks and
looking at the calls - I tried to workout the alternate interface
trying to stay away from the standard calls - for an atomic read/write
the overhead from having multiple calls could be combined into one is
what I was feeling.

> suggestions for protocol enhancement:
>
> Can we allow CREATE to piggyback write data if it's under 128 KB or whatever RPC size limit is, and optionally do a RELEASE after the WRITE?  Or just create a new FOP that does that?  Can we also specify xattrs that the application might want to set at create time?   Example, SMB security-related XATTRs, Swift metadata.
>
Does it make sense to take existing fop's and batch them for
submission then? Is that even possible?

I was thinking of a new FOP that could use pack multiple atomic
operations (primarily because the other operations seem to be taking a
certain path in the code and adding complexity to it may disturb it
for no good reason).

> Can we do something like we did for sequential writes with eager-lock, and allow Gluster client to hang on to directory lock for a little while so that we don't have to continually reacquire the lock if we are going to keep creating files in it?
>
> Second, if we already have a write lock on the directory, we shouldn't have to do LOOKUP then CREATE, just do CREATE directly.
>

If the request is a single atomic operation does this still hold?

> Finally, Swift and other apps use hack of rename() call after close() so that they can create a file atomically, if we had an API for creating files atomically then these apps would not be forced into using the expensive rename operation.
>
> Can we do these things in an incremental way so that we can steadily improve performance over time without massive disruption to code base?
>
> Perhaps Glusterfs FUSE mount could learn to do something like that as well with a special mount option that would allow actual create at server to be deferred until any one of these 3 conditions occurred:
>
> - 100 msec had passed, or
> - the file was closed, or
> - at least N KB of data was written (i.e. an RPC's worth)
>
>
> This is a bit like Nagle's algorithm in TCP, which allows TCP to aggregate more data into segments before it actually transmits them.  It technically violates POSIX and creates some semantic issues (how do you tell user that file already exists, for example?), but frankly fs interface in POSIX is an anachronism, we need to bend it a little to get what we need, NFS already does.  This might not be appropriate for all apps but there might be quite a few cases like initial data ingest where this would be a very reasonable thing to do.
>
>
>> The following is what I was thinking - please feel free to correct me
>> or guide me if someone has already done some ground work on this.
>>
>> For read, multiple objects can be provided and they should be
>> separated for read from appropriate brick based on the DHT flag - this
>> will help avoid multiple lookups from all servers. In the absence of
>> DHT they would be sent to all but only the ones that contain the
>> object respond (it's more like a multiple file lookup request).
>>
>
> I think it is very ambitious to batch creates for multiple files, and this greatly complicates the API.   Let's just get to a point where we can create a Gluster file and write the data for it in the same libgfapi call and have that work efficiently in the Gluster RPC interface -- this would be a huge win.
>

Agreed.

>> For write, same as the case of read, complete object writes (no
>> partial updates, file offsets etc.)
>>
>> For delete, most of the lookup and batching logic remains the same.
>>
>
> Delete is not the highest priority thing here.  Creates are the worst performers, so we probably should focus on creates.  someday it would be nice to be able to express the thought to the file system "delete this directory tree" or "delete all files within this directory", since Gluster could then make that a parallel operation, hence scalable.
>
-Siva