[Gluster-devel] Performance experiments with io-stats translator

Krutika Dhananjay kdhananj at redhat.com
Tue Jun 20 06:18:48 UTC 2017


Apologies. Pressed 'send' even before I was done.

On Tue, Jun 20, 2017 at 11:39 AM, Krutika Dhananjay <kdhananj at redhat.com>
wrote:

> Some update on this topic:
>
> I ran fio again, this time with Raghavendra's epoll-rearm patch @
> https://review.gluster.org/17391
>
> The IOPs increased to ~50K (from 38K).
> Avg READ latency as seen by the io-stats translator that sits above
> client-io-threads came down to 963us (from 1666us).
> ∆ (2,3) is down to 804us.
> The disk utilization didn't improve.
>

>From code reading, it appears there is some serialization between POLLIN,
POLLOUT and POLLERR events for a given socket because of
socket_private->lock which they all contend for.

Discussed the same with Raghavendra G.
(I think he already alluded to the same point in this thread earlier.)
Let me make some quick dirty changes to see if fixing this serialization
improves performance further and I'll update the thread accordingly.

-Krutika


>
>
> On Sat, Jun 10, 2017 at 12:47 AM, Manoj Pillai <mpillai at redhat.com> wrote:
>
>> So comparing the key latency, ∆ (2,3), in the two cases:
>>
>> iodepth=1: 171 us
>> iodepth=8: 1453 us (in the ballpark of 171*8=1368). That's not good! (I
>> wonder if that relation roughly holds up for other values of iodepth).
>>
>> This data doesn't conclusively establish that the problem is in gluster.
>> You'd see similar results if the network were saturated, like Vijay
>> suggested. But from what I remember of this test, the throughput here is
>> far too low for that to be the case.
>>
>> -- Manoj
>>
>>
>> On Thu, Jun 8, 2017 at 6:37 PM, Krutika Dhananjay <kdhananj at redhat.com>
>> wrote:
>>
>>> Indeed the latency on the client side dropped with iodepth=1. :)
>>> I ran the test twice and the results were consistent.
>>>
>>> Here are the exact numbers:
>>>
>>> *Translator Position*                       *Avg Latency of READ fop as
>>> seen by this translator*
>>>
>>> 1. parent of client-io-threads                437us
>>>
>>> ∆ (1,2) = 69us
>>>
>>> 2. parent of protocol/client-0                368us
>>>
>>> ∆ (2,3) = 171us
>>>
>>> ----------------- end of client stack ---------------------
>>> ----------------- beginning of brick stack --------------
>>>
>>> 3. child of protocol/server                   197us
>>>
>>> ∆ (3,4) = 4us
>>>
>>> 4. parent of io-threads                        193us
>>>
>>> ∆ (4,5) = 32us
>>>
>>> 5. child-of-io-threads                          161us
>>>
>>> ∆ (5,6) = 11us
>>>
>>> 6. parent of storage/posix                   150us
>>> ...
>>> ---------------- end of brick stack ------------------------
>>>
>>> Will continue reading code and get back when I find sth concrete.
>>>
>>> -Krutika
>>>
>>>
>>> On Thu, Jun 8, 2017 at 12:22 PM, Manoj Pillai <mpillai at redhat.com>
>>> wrote:
>>>
>>>> Thanks. So I was suggesting a repeat of the test but this time with
>>>> iodepth=1 in the fio job. If reducing the no. of concurrent requests
>>>>  reduces drastically the high latency you're seeing from the client-side,
>>>> that would strengthen the hypothesis than serialization/contention among
>>>> concurrent requests at the n/w layers is the root cause here.
>>>>
>>>> -- Manoj
>>>>
>>>>
>>>> On Thu, Jun 8, 2017 at 11:46 AM, Krutika Dhananjay <kdhananj at redhat.com
>>>> > wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> This is what my job file contains:
>>>>>
>>>>> [global]
>>>>> ioengine=libaio
>>>>> #unified_rw_reporting=1
>>>>> randrepeat=1
>>>>> norandommap=1
>>>>> group_reporting
>>>>> direct=1
>>>>> runtime=60
>>>>> thread
>>>>> size=16g
>>>>>
>>>>>
>>>>> [workload]
>>>>> bs=4k
>>>>> rw=randread
>>>>> iodepth=8
>>>>> numjobs=1
>>>>> file_service_type=random
>>>>> filename=/perf5/iotest/fio_5
>>>>> filename=/perf6/iotest/fio_6
>>>>> filename=/perf7/iotest/fio_7
>>>>> filename=/perf8/iotest/fio_8
>>>>>
>>>>> I have 3 vms reading from one mount, and each of these vms is running
>>>>> the above job in parallel.
>>>>>
>>>>> -Krutika
>>>>>
>>>>> On Tue, Jun 6, 2017 at 9:14 PM, Manoj Pillai <mpillai at redhat.com>
>>>>> wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Jun 6, 2017 at 5:05 PM, Krutika Dhananjay <
>>>>>> kdhananj at redhat.com> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> As part of identifying performance bottlenecks within gluster stack
>>>>>>> for VM image store use-case, I loaded io-stats at multiple points on the
>>>>>>> client and brick stack and ran randrd test using fio from within the hosted
>>>>>>> vms in parallel.
>>>>>>>
>>>>>>> Before I get to the results, a little bit about the configuration ...
>>>>>>>
>>>>>>> 3 node cluster; 1x3 plain replicate volume with group virt settings,
>>>>>>> direct-io.
>>>>>>> 3 FUSE clients, one per node in the cluster (which implies reads are
>>>>>>> served from the replica that is local to the client).
>>>>>>>
>>>>>>> io-stats was loaded at the following places:
>>>>>>> On the client stack: Above client-io-threads and above
>>>>>>> protocol/client-0 (the first child of AFR).
>>>>>>> On the brick stack: Below protocol/server, above and below
>>>>>>> io-threads and just above storage/posix.
>>>>>>>
>>>>>>> Based on a 60-second run of randrd test and subsequent analysis of
>>>>>>> the stats dumped by the individual io-stats instances, the following is
>>>>>>> what I found:
>>>>>>>
>>>>>>> *​​Translator Position*                       *Avg Latency of READ
>>>>>>> fop as seen by this translator*
>>>>>>>
>>>>>>> 1. parent of client-io-threads                1666us
>>>>>>>
>>>>>>> ∆ (1,2) = 50us
>>>>>>>
>>>>>>> 2. parent of protocol/client-0                1616us
>>>>>>>
>>>>>>> ∆ (2,3) = 1453us
>>>>>>>
>>>>>>> ----------------- end of client stack ---------------------
>>>>>>> ----------------- beginning of brick stack -----------
>>>>>>>
>>>>>>> 3. child of protocol/server                   163us
>>>>>>>
>>>>>>> ∆ (3,4) = 7us
>>>>>>>
>>>>>>> 4. parent of io-threads                        156us
>>>>>>>
>>>>>>> ∆ (4,5) = 20us
>>>>>>>
>>>>>>> 5. child-of-io-threads                          136us
>>>>>>>
>>>>>>> ∆ (5,6) = 11us
>>>>>>>
>>>>>>> 6. parent of storage/posix                   125us
>>>>>>> ...
>>>>>>> ---------------- end of brick stack ------------------------
>>>>>>>
>>>>>>> So it seems like the biggest bottleneck here is a combination of the
>>>>>>> network + epoll, rpc layer?
>>>>>>> I must admit I am no expert with networks, but I'm assuming if the
>>>>>>> client is reading from the local brick, then
>>>>>>> even latency contribution from the actual network won't be much, in
>>>>>>> which case bulk of the latency is coming from epoll, rpc layer, etc at both
>>>>>>> client and brick end? Please correct me if I'm wrong.
>>>>>>>
>>>>>>> I will, of course, do some more runs and confirm if the pattern is
>>>>>>> consistent.
>>>>>>>
>>>>>>> -Krutika
>>>>>>>
>>>>>>>
>>>>>> Really interesting numbers! How many concurrent requests are in
>>>>>> flight in this test? Could you post the fio job? I'm wondering if/how these
>>>>>> latency numbers change if you reduce the number of concurrent requests.
>>>>>>
>>>>>> -- Manoj
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-devel/attachments/20170620/a3981fa7/attachment-0001.html>


More information about the Gluster-devel mailing list