[Gluster-devel] Feature help

Tue Nov 4 13:00:55 UTC 2014

inline...

----- Original Message -----
> From: "Rudra Siva" <rudrasiva11 at gmail.com>
> To: gluster-devel at gluster.org
> Sent: Saturday, November 1, 2014 10:20:41 AM
> Subject: [Gluster-devel] Feature help
> 
> Hi,
> 
> I'm very interested in helping with this feature by way of development
> help, testing and or benchmarking.
> 

I have a parallel-libgfapi benchmark that could be modified to fit the new API, and could test performance of it.

https://github.com/bengland2/parallel-libgfapi

> Features/Feature Smallfile Perf
> 
> One of the things I was looking into was possibility of adding a few
> API calls to libgfapi to help allow reading and writing multiple small
> files as objects - just as librados does for ceph - cutting out FUSE
> and other semantics that tend to be overheads for really small files.
> I don't know what else I will have to add for libgfapi to support
> this.
> 

libgfapi is a good place to prototype, it's easy to change libgfapi by adding to the existing calls, but this won't help performance as much as you might want unless the Gluster protocol can somehow change to allow combination of several separate FOPS such as LOOKUP, OPEN, READ and RELEASE FOPS and LOOKUP, CREATE, WRITE and RELEASE FOPS.   That's the hard part IMHO.  I suggest using wireshark to watch Gluster small-file creates, and then try to understand what each FOP is doing and why it is there.  

suggestions for protocol enhancement:

Can we allow CREATE to piggyback write data if it's under 128 KB or whatever RPC size limit is, and optionally do a RELEASE after the WRITE?  Or just create a new FOP that does that?  Can we also specify xattrs that the application might want to set at create time?   Example, SMB security-related XATTRs, Swift metadata.

Can we do something like we did for sequential writes with eager-lock, and allow Gluster client to hang on to directory lock for a little while so that we don't have to continually reacquire the lock if we are going to keep creating files in it?

Second, if we already have a write lock on the directory, we shouldn't have to do LOOKUP then CREATE, just do CREATE directly.

Finally, Swift and other apps use hack of rename() call after close() so that they can create a file atomically, if we had an API for creating files atomically then these apps would not be forced into using the expensive rename operation.

Can we do these things in an incremental way so that we can steadily improve performance over time without massive disruption to code base?

Perhaps Glusterfs FUSE mount could learn to do something like that as well with a special mount option that would allow actual create at server to be deferred until any one of these 3 conditions occurred:

- 100 msec had passed, or 
- the file was closed, or
- at least N KB of data was written (i.e. an RPC's worth)

This is a bit like Nagle's algorithm in TCP, which allows TCP to aggregate more data into segments before it actually transmits them.  It technically violates POSIX and creates some semantic issues (how do you tell user that file already exists, for example?), but frankly fs interface in POSIX is an anachronism, we need to bend it a little to get what we need, NFS already does.  This might not be appropriate for all apps but there might be quite a few cases like initial data ingest where this would be a very reasonable thing to do.

> The following is what I was thinking - please feel free to correct me
> or guide me if someone has already done some ground work on this.
> 
> For read, multiple objects can be provided and they should be
> separated for read from appropriate brick based on the DHT flag - this
> will help avoid multiple lookups from all servers. In the absence of
> DHT they would be sent to all but only the ones that contain the
> object respond (it's more like a multiple file lookup request).
> 

I think it is very ambitious to batch creates for multiple files, and this greatly complicates the API.   Let's just get to a point where we can create a Gluster file and write the data for it in the same libgfapi call and have that work efficiently in the Gluster RPC interface -- this would be a huge win.  

> For write, same as the case of read, complete object writes (no
> partial updates, file offsets etc.)
> 
> For delete, most of the lookup and batching logic remains the same.
>

Delete is not the highest priority thing here.  Creates are the worst performers, so we probably should focus on creates.  someday it would be nice to be able to express the thought to the file system "delete this directory tree" or "delete all files within this directory", since Gluster could then make that a parallel operation, hence scalable.

> I can help with testing, documentation or benchmarks if someone has
> already done some work.
> 
> -Siva
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at gluster.org
> http://supercolony.gluster.org/mailman/listinfo/gluster-devel
>