[Gluster-devel] Feature help

Shyam srangana at redhat.com
Mon Nov 10 19:11:47 UTC 2014


On 11/01/2014 10:20 AM, Rudra Siva wrote:
> Hi,
>
> I'm very interested in helping with this feature by way of development
> help, testing and or benchmarking.
>
> Features/Feature Smallfile Perf
>
> One of the things I was looking into was possibility of adding a few
> API calls to libgfapi to help allow reading and writing multiple small
> files as objects - just as librados does for ceph - cutting out FUSE
> and other semantics that tend to be overheads for really small files.
> I don't know what else I will have to add for libgfapi to support
> this.

The response below is based on, reading into this mail and the other 
mail that you sent titled "libgfapi object api" which I believe expands 
on that actual APIs that you are thinking of. (the following commentary 
is to glean more information, as this is something that can help small 
file performance, and could be the result of my own misunderstanding :) )

- Who are the consumers of such an API?
The way I see it, FUSE does not have a direct way to use this 
enhancement, unless we think of ways, like the ones that Ben proposed to 
defer and detect small file creates.

Neither does NFS or SMB protocol implementations.

Swift has a use case here, as they need to put/get objects atomically 
and can have a good benefit of having a single API rather than plough 
through multiple ones and ensuring atomicity using renames (again stated 
by Ben in the other mail). BUT, we cannot have the entire object that we 
need to write and then invoke the API (consider an object 1GB in size). 
So instead Swift would have to use this when it has an entire object and 
do some optimization like the ones suggested for FUSE like Ben.

Hence the question, who are the consumers of this API?

- Do these interfaces create the files if absent on writes?

IOW, is this for existing objects/files or to extend the use case into 
creating and writing files as objects?

>
> The following is what I was thinking - please feel free to correct me
> or guide me if someone has already done some ground work on this.
>
> For read, multiple objects can be provided and they should be
> separated for read from appropriate brick based on the DHT flag - this
> will help avoid multiple lookups from all servers. In the absence of
> DHT they would be sent to all but only the ones that contain the
> object respond (it's more like a multiple file lookup request).

The above section for me is sketchy in details, but the following 
questions do crop up,
- What do you mean by "separated for read from appropriate brick based 
on the DHT flag"?

If *objects* array is a list of names of objects/files under 
*store_path*, we still need to determine which subvolume of DHT these 
exist on (which could be AFR subvols) and then read from the right 
subvol. This information could be cached on the inode in the client 
stack already in DHT, which would avoid the lookup anyway, if not these 
need to be looked up and found in the appropriate subvols. What is it 
that we are trying to avoid or optimize here and how?

>
> For write, same as the case of read, complete object writes (no
> partial updates, file offsets etc.)
>
> For delete, most of the lookup and batching logic remains the same.
>
> I can help with testing, documentation or benchmarks if someone has
> already done some work.

There was a mention of writing a feature page for this enhancement, I 
would suggest doing that, even if premature, so that details are better 
elaborated and understood (by me at least).

HTH,
Shyam


More information about the Gluster-devel mailing list