[Gluster-devel] regarding GF_CONTENT_KEY and dht2 - perf with small files

Pranith Kumar Karampuri pkarampu at redhat.com
Tue Feb 2 14:22:25 UTC 2016



On 02/02/2016 06:22 PM, Jeff Darcy wrote:
>>        Background: Quick-read + open-behind xlators are developed to help
>> in small file workload reads like apache webserver, tar etc to get the
>> data of the file in lookup FOP itself. What happens is, when a lookup
>> FOP is executed, GF_CONTENT_KEY is added in xdata with max-length and
>> posix xlator reads the file and fills the data in xdata response if this
>> key is present as long as the file-size is less than max-length given in
>> the xdata. So when we do a tar of something like a kernel tree with
>> small files, if we look at profile of the bricks all we see are lookups.
>> OPEN + READ fops will not be sent at all over the network.
>>
>>        With dht2 because data is present on a different cluster. We can't
>> get the data in lookup. Shyam was telling me that opens are also sent to
>> metadata cluster. That will make perf in this usecase back to where it
>> was before introducing these two features i.e. 1/3 of current perf
>> (Lookup vs lookup+open+read)
> Is "1/3 of current perf" based on actual measurements?  My understanding
> was that the translators in question exist to send requests *in parallel*
> with the original lookup stream.  That means it might be 3x the messages,
> but it will only be 1/3 the performance if the network is saturated.
> Also, the lookup is not guaranteed to be only one message.  It might be
> as many as N (the number of bricks), so by the reasoning above the
> performance would only drop to N/N+2.  I think the real situation is a
> bit more complicated - and less dire - than you suggest.

As per what I heard, when quick read (Now divided as open-behind and 
quick-read) was introduced webserver use case users reported 300% to 
400% perf improvement.
We should definitely test it once we have enough code to do so. I am 
just giving a heads up.

Having said that, for 'tar' I think we can most probably do a better job 
in dht2 because even after readdirp a nameless lookup comes. If it has 
GF_CONTENT_KEY we should send it to data cluster directly. For webserver 
usecase I don't have any ideas.

At least on my laptop this is what I saw, on a setup with different 
client, server machines, situation could be worse. This is distribute 
volume with one brick.

root at localhost - /mnt/d1
19:42:52 :) ⚡ time tar cf a.tgz a

real    0m6.987s
user    0m0.089s
sys    0m0.481s

root at localhost - /mnt/d1
19:43:22 :) ⚡ cd

root at localhost - ~
19:43:25 :) ⚡ umount /mnt/d1

root at localhost - ~
19:43:27 :) ⚡ gluster volume set d1 open-behind off
volume set: success

root at localhost - ~
19:43:47 :) ⚡ gluster volume set d1 quick-read off
volume set: success

root at localhost - ~
19:44:03 :( ⚡ gluster volume stop d1
Stopping volume will make its data inaccessible. Do you want to 
continue? (y/n) y
volume stop: d1: success

root at localhost - ~
19:44:09 :) ⚡ gluster volume start d1
volume start: d1: success

root at localhost - ~
19:44:13 :) ⚡ mount -t glusterfs localhost.localdomain:/d1 /mnt/d1

root at localhost - ~
19:44:29 :) ⚡ cd /mnt/d1

root at localhost - /mnt/d1
19:44:30 :) ⚡ time tar cf b.tgz a

real    0m12.176s
user    0m0.098s
sys    0m0.582s

Pranith
>
>> I suggest that we send some fop at the
>> time of open to data cluster and change quick-read to cache this data on
>> open (if not already) then we can reduce the perf hit to 1/2 of current
>> perf, i.e. lookup+open.
> At first glance, it seems pretty simple to do something like this, and
> pretty obvious that we should.  The tricky question is: where should we
> send that other op, before lookup has told us where the partition
> containing that file is?  If there's some reasonable guess we can make,
> the sending an open+read in parallel with the lookup will be helpful.
> If not, then it will probably be a waste of time and network resources.
> Shyam, is enough of this information being cached *on the clients* to
> make this effective?
Pranith


More information about the Gluster-devel mailing list