[Gluster-devel] regarding GF_CONTENT_KEY and dht2 - perf with small files

Pranith Kumar Karampuri pkarampu at redhat.com
Wed Feb 3 06:43:37 UTC 2016

On 02/03/2016 11:49 AM, Pranith Kumar Karampuri wrote:
> On 02/03/2016 09:20 AM, Shyam wrote:
>> On 02/02/2016 06:22 PM, Jeff Darcy wrote:
>>>>        Background: Quick-read + open-behind xlators are developed 
>>>> to help
>>>> in small file workload reads like apache webserver, tar etc to get the
>>>> data of the file in lookup FOP itself. What happens is, when a lookup
>>>> FOP is executed, GF_CONTENT_KEY is added in xdata with max-length and
>>>> posix xlator reads the file and fills the data in xdata response if 
>>>> this
>>>> key is present as long as the file-size is less than max-length 
>>>> given in
>>>> the xdata. So when we do a tar of something like a kernel tree with
>>>> small files, if we look at profile of the bricks all we see are 
>>>> lookups.
>>>> OPEN + READ fops will not be sent at all over the network.
>>>>        With dht2 because data is present on a different cluster. We 
>>>> can't
>>>> get the data in lookup. Shyam was telling me that opens are also 
>>>> sent to
>>>> metadata cluster. That will make perf in this usecase back to where it
>>>> was before introducing these two features i.e. 1/3 of current perf
>>>> (Lookup vs lookup+open+read)
>> This is interesting thanks for the heads up.
>>> Is "1/3 of current perf" based on actual measurements?  My 
>>> understanding
>>> was that the translators in question exist to send requests *in 
>>> parallel*
>>> with the original lookup stream.  That means it might be 3x the 
>>> messages,
>>> but it will only be 1/3 the performance if the network is saturated.
>>> Also, the lookup is not guaranteed to be only one message.  It might be
>>> as many as N (the number of bricks), so by the reasoning above the
>>> performance would only drop to N/N+2.  I think the real situation is a
>>> bit more complicated - and less dire - than you suggest.
>>>> I suggest that we send some fop at the
>>>> time of open to data cluster and change quick-read to cache this 
>>>> data on
>>>> open (if not already) then we can reduce the perf hit to 1/2 of 
>>>> current
>>>> perf, i.e. lookup+open.
>>> At first glance, it seems pretty simple to do something like this, and
>>> pretty obvious that we should.  The tricky question is: where should we
>>> send that other op, before lookup has told us where the partition
>>> containing that file is?  If there's some reasonable guess we can make,
>>> the sending an open+read in parallel with the lookup will be helpful.
>>> If not, then it will probably be a waste of time and network resources.
>>> Shyam, is enough of this information being cached *on the clients* to
>>> make this effective?
>> The file data would be located based on its GFID, so before the 
>> *first* lookup/stat for a file, there is no way to know it's GFID.
>> NOTE: Instead of a name hash the GFID hash is used, to get immunity 
>> against renames and the like, as a name hash could change the 
>> location information for the file (among other reasons).
>> The open+read can be done as a single FOP,
>>   - open for a read only case can do access checking on the client to 
>> allow the FOP to proceed to the DS without hitting the MDS for an 
>> open token
>> The client side cache is important from this and other such 
>> perspectives. It should also leverage upcall infra to keep the cache 
>> loosely coherent.
>> One thing to note here would be, for the client to do a lookup (where 
>> the file name should be known before hand), either a readdir/(p) has 
>> to have happened, or the client knows the name already (say 
>> application generated names). For the former (readdir case), there is 
>> enough information on the client to not need a lookup, but rather 
>> just do the open+read on the DS. For the latter the first lookup 
>> cannot be avoided, degrading this to a lookup+(open+read).
>> Some further tricks can be done to do readdir prefetching on such 
>> workloads, as the MDS runs on a DB (eventually), piggybacking more 
>> entries than requested on a lookup. I would possibly leave that for 
>> later, based on performance numbers in the small file area.
> I strongly suggest that we don't postpone this to later as I think 
> this is a solved problem. http://www.ietf.org/rfc/rfc4122.txt section 
> 4.3 may be of help here. i.e. create UUID based on string, namespace. 
> So we can use pgfid as namespace and filename as string. I understand 
> that we will get into 2 hops if the file is renamed, but it is the 
> best we can do right now. We can take help from crypto team in Redhat 
> to make sure we do the right thing. If we get this implementation in 
> dht2 after the code is released all the files created with old 
> gfid-generation will work with half the possible perf.
Gah! ignore, it will lead to gfid collisions :-/

> Pranith
>> Shyam
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel

More information about the Gluster-devel mailing list