[Gluster-devel] Performance experiments with io-stats translator

Tue Jun 20 06:09:01 UTC 2017

Some update on this topic:

I ran fio again, this time with Raghavendra's epoll-rearm patch @
https://review.gluster.org/17391

The IOPs increased to ~50K (from 38K).
Avg READ latency as seen by the io-stats translator that sits above
client-io-threads came down to 963us (from 1666us).
∆ (2,3) is down to 804us.
The disk utilization didn't improve.

On Sat, Jun 10, 2017 at 12:47 AM, Manoj Pillai <mpillai at redhat.com> wrote:

> So comparing the key latency, ∆ (2,3), in the two cases:
>
> iodepth=1: 171 us
> iodepth=8: 1453 us (in the ballpark of 171*8=1368). That's not good! (I
> wonder if that relation roughly holds up for other values of iodepth).
>
> This data doesn't conclusively establish that the problem is in gluster.
> You'd see similar results if the network were saturated, like Vijay
> suggested. But from what I remember of this test, the throughput here is
> far too low for that to be the case.
>
> -- Manoj
>
>
> On Thu, Jun 8, 2017 at 6:37 PM, Krutika Dhananjay <kdhananj at redhat.com>
> wrote:
>
>> Indeed the latency on the client side dropped with iodepth=1. :)
>> I ran the test twice and the results were consistent.
>>
>> Here are the exact numbers:
>>
>> *Translator Position*                       *Avg Latency of READ fop as
>> seen by this translator*
>>
>> 1. parent of client-io-threads                437us
>>
>> ∆ (1,2) = 69us
>>
>> 2. parent of protocol/client-0                368us
>>
>> ∆ (2,3) = 171us
>>
>> ----------------- end of client stack ---------------------
>> ----------------- beginning of brick stack --------------
>>
>> 3. child of protocol/server                   197us
>>
>> ∆ (3,4) = 4us
>>
>> 4. parent of io-threads                        193us
>>
>> ∆ (4,5) = 32us
>>
>> 5. child-of-io-threads                          161us
>>
>> ∆ (5,6) = 11us
>>
>> 6. parent of storage/posix                   150us
>> ...
>> ---------------- end of brick stack ------------------------
>>
>> Will continue reading code and get back when I find sth concrete.
>>
>> -Krutika
>>
>>
>> On Thu, Jun 8, 2017 at 12:22 PM, Manoj Pillai <mpillai at redhat.com> wrote:
>>
>>> Thanks. So I was suggesting a repeat of the test but this time with
>>> iodepth=1 in the fio job. If reducing the no. of concurrent requests
>>>  reduces drastically the high latency you're seeing from the client-side,
>>> that would strengthen the hypothesis than serialization/contention among
>>> concurrent requests at the n/w layers is the root cause here.
>>>
>>> -- Manoj
>>>
>>>
>>> On Thu, Jun 8, 2017 at 11:46 AM, Krutika Dhananjay <kdhananj at redhat.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> This is what my job file contains:
>>>>
>>>> [global]
>>>> ioengine=libaio
>>>> #unified_rw_reporting=1
>>>> randrepeat=1
>>>> norandommap=1
>>>> group_reporting
>>>> direct=1
>>>> runtime=60
>>>> thread
>>>> size=16g
>>>>
>>>>
>>>> [workload]
>>>> bs=4k
>>>> rw=randread
>>>> iodepth=8
>>>> numjobs=1
>>>> file_service_type=random
>>>> filename=/perf5/iotest/fio_5
>>>> filename=/perf6/iotest/fio_6
>>>> filename=/perf7/iotest/fio_7
>>>> filename=/perf8/iotest/fio_8
>>>>
>>>> I have 3 vms reading from one mount, and each of these vms is running
>>>> the above job in parallel.
>>>>
>>>> -Krutika
>>>>
>>>> On Tue, Jun 6, 2017 at 9:14 PM, Manoj Pillai <mpillai at redhat.com>
>>>> wrote:
>>>>
>>>>>
>>>>>
>>>>> On Tue, Jun 6, 2017 at 5:05 PM, Krutika Dhananjay <kdhananj at redhat.com
>>>>> > wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> As part of identifying performance bottlenecks within gluster stack
>>>>>> for VM image store use-case, I loaded io-stats at multiple points on the
>>>>>> client and brick stack and ran randrd test using fio from within the hosted
>>>>>> vms in parallel.
>>>>>>
>>>>>> Before I get to the results, a little bit about the configuration ...
>>>>>>
>>>>>> 3 node cluster; 1x3 plain replicate volume with group virt settings,
>>>>>> direct-io.
>>>>>> 3 FUSE clients, one per node in the cluster (which implies reads are
>>>>>> served from the replica that is local to the client).
>>>>>>
>>>>>> io-stats was loaded at the following places:
>>>>>> On the client stack: Above client-io-threads and above
>>>>>> protocol/client-0 (the first child of AFR).
>>>>>> On the brick stack: Below protocol/server, above and below io-threads
>>>>>> and just above storage/posix.
>>>>>>
>>>>>> Based on a 60-second run of randrd test and subsequent analysis of
>>>>>> the stats dumped by the individual io-stats instances, the following is
>>>>>> what I found:
>>>>>>
>>>>>> *Translator Position*                       *Avg Latency of READ
>>>>>> fop as seen by this translator*
>>>>>>
>>>>>> 1. parent of client-io-threads                1666us
>>>>>>
>>>>>> ∆ (1,2) = 50us
>>>>>>
>>>>>> 2. parent of protocol/client-0                1616us
>>>>>>
>>>>>> ∆ (2,3) = 1453us
>>>>>>
>>>>>> ----------------- end of client stack ---------------------
>>>>>> ----------------- beginning of brick stack -----------
>>>>>>
>>>>>> 3. child of protocol/server                   163us
>>>>>>
>>>>>> ∆ (3,4) = 7us
>>>>>>
>>>>>> 4. parent of io-threads                        156us
>>>>>>
>>>>>> ∆ (4,5) = 20us
>>>>>>
>>>>>> 5. child-of-io-threads                          136us
>>>>>>
>>>>>> ∆ (5,6) = 11us
>>>>>>
>>>>>> 6. parent of storage/posix                   125us
>>>>>> ...
>>>>>> ---------------- end of brick stack ------------------------
>>>>>>
>>>>>> So it seems like the biggest bottleneck here is a combination of the
>>>>>> network + epoll, rpc layer?
>>>>>> I must admit I am no expert with networks, but I'm assuming if the
>>>>>> client is reading from the local brick, then
>>>>>> even latency contribution from the actual network won't be much, in
>>>>>> which case bulk of the latency is coming from epoll, rpc layer, etc at both
>>>>>> client and brick end? Please correct me if I'm wrong.
>>>>>>
>>>>>> I will, of course, do some more runs and confirm if the pattern is
>>>>>> consistent.
>>>>>>
>>>>>> -Krutika
>>>>>>
>>>>>>
>>>>> Really interesting numbers! How many concurrent requests are in flight
>>>>> in this test? Could you post the fio job? I'm wondering if/how these
>>>>> latency numbers change if you reduce the number of concurrent requests.
>>>>>
>>>>> -- Manoj
>>>>>
>>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-devel/attachments/20170620/fe9fac81/attachment-0001.html>