[Gluster-devel] Feature help
Shyam
srangana at redhat.com
Mon Nov 10 19:11:47 UTC 2014
On 11/01/2014 10:20 AM, Rudra Siva wrote:
> Hi,
>
> I'm very interested in helping with this feature by way of development
> help, testing and or benchmarking.
>
> Features/Feature Smallfile Perf
>
> One of the things I was looking into was possibility of adding a few
> API calls to libgfapi to help allow reading and writing multiple small
> files as objects - just as librados does for ceph - cutting out FUSE
> and other semantics that tend to be overheads for really small files.
> I don't know what else I will have to add for libgfapi to support
> this.
The response below is based on, reading into this mail and the other
mail that you sent titled "libgfapi object api" which I believe expands
on that actual APIs that you are thinking of. (the following commentary
is to glean more information, as this is something that can help small
file performance, and could be the result of my own misunderstanding :) )
- Who are the consumers of such an API?
The way I see it, FUSE does not have a direct way to use this
enhancement, unless we think of ways, like the ones that Ben proposed to
defer and detect small file creates.
Neither does NFS or SMB protocol implementations.
Swift has a use case here, as they need to put/get objects atomically
and can have a good benefit of having a single API rather than plough
through multiple ones and ensuring atomicity using renames (again stated
by Ben in the other mail). BUT, we cannot have the entire object that we
need to write and then invoke the API (consider an object 1GB in size).
So instead Swift would have to use this when it has an entire object and
do some optimization like the ones suggested for FUSE like Ben.
Hence the question, who are the consumers of this API?
- Do these interfaces create the files if absent on writes?
IOW, is this for existing objects/files or to extend the use case into
creating and writing files as objects?
>
> The following is what I was thinking - please feel free to correct me
> or guide me if someone has already done some ground work on this.
>
> For read, multiple objects can be provided and they should be
> separated for read from appropriate brick based on the DHT flag - this
> will help avoid multiple lookups from all servers. In the absence of
> DHT they would be sent to all but only the ones that contain the
> object respond (it's more like a multiple file lookup request).
The above section for me is sketchy in details, but the following
questions do crop up,
- What do you mean by "separated for read from appropriate brick based
on the DHT flag"?
If *objects* array is a list of names of objects/files under
*store_path*, we still need to determine which subvolume of DHT these
exist on (which could be AFR subvols) and then read from the right
subvol. This information could be cached on the inode in the client
stack already in DHT, which would avoid the lookup anyway, if not these
need to be looked up and found in the appropriate subvols. What is it
that we are trying to avoid or optimize here and how?
>
> For write, same as the case of read, complete object writes (no
> partial updates, file offsets etc.)
>
> For delete, most of the lookup and batching logic remains the same.
>
> I can help with testing, documentation or benchmarks if someone has
> already done some work.
There was a mention of writing a feature page for this enhancement, I
would suggest doing that, even if premature, so that details are better
elaborated and understood (by me at least).
HTH,
Shyam
More information about the Gluster-devel
mailing list