[Gluster-devel] regarding GF_CONTENT_KEY and dht2 - perf with small files

Wed Feb 3 08:41:36 UTC 2016

On 02/03/2016 09:20 AM, Shyam wrote:
> On 02/02/2016 06:22 PM, Jeff Darcy wrote:
>>>        Background: Quick-read + open-behind xlators are developed to
>>> help
>>> in small file workload reads like apache webserver, tar etc to get the
>>> data of the file in lookup FOP itself. What happens is, when a lookup
>>> FOP is executed, GF_CONTENT_KEY is added in xdata with max-length and
>>> posix xlator reads the file and fills the data in xdata response if this
>>> key is present as long as the file-size is less than max-length given in
>>> the xdata. So when we do a tar of something like a kernel tree with
>>> small files, if we look at profile of the bricks all we see are lookups.
>>> OPEN + READ fops will not be sent at all over the network.
>>>
>>>        With dht2 because data is present on a different cluster. We
>>> can't
>>> get the data in lookup. Shyam was telling me that opens are also sent to
>>> metadata cluster. That will make perf in this usecase back to where it
>>> was before introducing these two features i.e. 1/3 of current perf
>>> (Lookup vs lookup+open+read)
>
> This is interesting thanks for the heads up.
>
>>
>> Is "1/3 of current perf" based on actual measurements?  My understanding
>> was that the translators in question exist to send requests *in parallel*
>> with the original lookup stream.  That means it might be 3x the messages,
>> but it will only be 1/3 the performance if the network is saturated.
>> Also, the lookup is not guaranteed to be only one message.  It might be
>> as many as N (the number of bricks), so by the reasoning above the
>> performance would only drop to N/N+2.  I think the real situation is a
>> bit more complicated - and less dire - than you suggest.
>>
>>> I suggest that we send some fop at the
>>> time of open to data cluster and change quick-read to cache this data on
>>> open (if not already) then we can reduce the perf hit to 1/2 of current
>>> perf, i.e. lookup+open.
>>
>> At first glance, it seems pretty simple to do something like this, and
>> pretty obvious that we should.  The tricky question is: where should we
>> send that other op, before lookup has told us where the partition
>> containing that file is?  If there's some reasonable guess we can make,
>> the sending an open+read in parallel with the lookup will be helpful.
>> If not, then it will probably be a waste of time and network resources.
>> Shyam, is enough of this information being cached *on the clients* to
>> make this effective?
>>
>
> The file data would be located based on its GFID, so before the *first*
> lookup/stat for a file, there is no way to know it's GFID.
> NOTE: Instead of a name hash the GFID hash is used, to get immunity
> against renames and the like, as a name hash could change the location
> information for the file (among other reasons).

Another manner of achieving the same when the GFID of the file is known 
(from a readdir) is to wind the lookup and read of size to the 
respective MDS and DS, where the lookup would be responded to once the 
MDS responds, and the DS response is cached for the subsequent open+read 
case. So on the wire we would have a fan out of 2 FOPs, but still 
satisfy the quick read requirements.

I would assume the above resolves the problem posted, are there cases 
where we do not know the GFID of the file? i.e no readdir performed and 
client knows the file name that it wants to operate on? Do we have 
traces of the webserver workload to see if it generates names on the fly 
or does a readdir prior to that?

>
> The open+read can be done as a single FOP,
>    - open for a read only case can do access checking on the client to
> allow the FOP to proceed to the DS without hitting the MDS for an open
> token
>
> The client side cache is important from this and other such
> perspectives. It should also leverage upcall infra to keep the cache
> loosely coherent.
>
> One thing to note here would be, for the client to do a lookup (where
> the file name should be known before hand), either a readdir/(p) has to
> have happened, or the client knows the name already (say application
> generated names). For the former (readdir case), there is enough
> information on the client to not need a lookup, but rather just do the
> open+read on the DS. For the latter the first lookup cannot be avoided,
> degrading this to a lookup+(open+read).
>
> Some further tricks can be done to do readdir prefetching on such
> workloads, as the MDS runs on a DB (eventually), piggybacking more
> entries than requested on a lookup. I would possibly leave that for
> later, based on performance numbers in the small file area.
>
> Shyam