[Gluster-devel] Performance experiments with io-stats translator

Thu Jun 8 06:44:51 UTC 2017

@Xavi/Raghavendra,

Indeed. Even I suspect the mutex contention at epoll layer and I've been
reading the corresponding code (my first time) ever since I got these
numbers.
I will get back to you if I have any specific questions for you around this.

-Krutika

On Thu, Jun 8, 2017 at 9:58 AM, Raghavendra G <raghavendra at gluster.com>
wrote:

>
>
> On Wed, Jun 7, 2017 at 11:59 AM, Xavier Hernandez <xhernandez at datalab.es>
> wrote:
>
>> Hi Krutika,
>>
>> On 06/06/17 13:35, Krutika Dhananjay wrote:
>>
>>> Hi,
>>>
>>> As part of identifying performance bottlenecks within gluster stack for
>>> VM image store use-case, I loaded io-stats at multiple points on the
>>> client and brick stack and ran randrd test using fio from within the
>>> hosted vms in parallel.
>>>
>>> Before I get to the results, a little bit about the configuration ...
>>>
>>> 3 node cluster; 1x3 plain replicate volume with group virt settings,
>>> direct-io.
>>> 3 FUSE clients, one per node in the cluster (which implies reads are
>>> served from the replica that is local to the client).
>>>
>>> io-stats was loaded at the following places:
>>> On the client stack: Above client-io-threads and above protocol/client-0
>>> (the first child of AFR).
>>> On the brick stack: Below protocol/server, above and below io-threads
>>> and just above storage/posix.
>>>
>>> Based on a 60-second run of randrd test and subsequent analysis of the
>>> stats dumped by the individual io-stats instances, the following is what
>>> I found:
>>>
>>> _*Translator Position*_*                       *_*Avg Latency of READ
>>> fop as seen by this translator*_
>>>
>>> 1. parent of client-io-threads                1666us
>>>
>>> ∆ (1,2) = 50us
>>>
>>> 2. parent of protocol/client-0                1616us
>>>
>>> ∆(2,3) = 1453us
>>>
>>> ----------------- end of client stack ---------------------
>>> ----------------- beginning of brick stack -----------
>>>
>>> 3. child of protocol/server                   163us
>>>
>>> ∆(3,4) = 7us
>>>
>>> 4. parent of io-threads                        156us
>>>
>>> ∆(4,5) = 20us
>>>
>>> 5. child-of-io-threads                          136us
>>>
>>> ∆ (5,6) = 11us
>>>
>>> 6. parent of storage/posix                   125us
>>> ...
>>> ---------------- end of brick stack ------------------------
>>>
>>> So it seems like the biggest bottleneck here is a combination of the
>>> network + epoll, rpc layer?
>>> I must admit I am no expert with networks, but I'm assuming if the
>>> client is reading from the local brick, then
>>> even latency contribution from the actual network won't be much, in
>>> which case bulk of the latency is coming from epoll, rpc layer, etc at
>>> both client and brick end? Please correct me if I'm wrong.
>>>
>>> I will, of course, do some more runs and confirm if the pattern is
>>> consistent.
>>>
>>
>> very interesting. These results are similar to what I also observed when
>> doing some ec tests.
>>
>
> For EC we've found [1] to increase the performance. Though not sure
> whether it'll have any significant impact for replicated setups.
>
>
> My personal feeling is that there's high serialization and/or contention
>> in the network layer caused by mutexes, but I don't have data to support
>> that.
>>
>
> As to lock contention or lack of concurrency at socket/rpc layers, AFAIK
> we've following suspects in I/O path (as opposed to accepting/listen paths):
>
> * Only one of reading from socket, writing to socket, error handling on
> socket, voluntary shutdown of sockets (through shutdown) can be in progress
> at a time. IOW, these operations are not concurrent as each one of them
> acquires a lock contended by others. My gut feeling is that at least
> reading from socket and writing to socket can be made concurrent and I've
> to spend more time on this to have a definitive answer.
>
> * Till [1], handler also incurred cost of message processing by higher
> layers (not just the cost of reading a msg from socket). Since we've epoll
> configured with EPOLL_ONESHOT and add back socket only after handler
> completes there was a lag after one msg is read before another msg can be
> read from same socket.
>
> * EPOLL_ONESHOT also means processing of one event (say POLLIN) also
> excludes other events (like POLLOUT when lots of msgs waiting to be written
> to socket) till the event is processed. The vice-versa scenario - reads
> blocked when writes are pending on a socket and a POLLOUT is received - is
> also true here. I think this is another area where we can improve.
>
> Will update the thread as and when I can think of a valid suspect.
>
> [1] https://review.gluster.org/17391
>
>
>>
>> Xavi
>>
>>
>>
>>> -Krutika
>>>
>>>
>>> _______________________________________________
>>> Gluster-devel mailing list
>>> Gluster-devel at gluster.org
>>> http://lists.gluster.org/mailman/listinfo/gluster-devel
>>>
>>>
>> _______________________________________________
>> Gluster-devel mailing list
>> Gluster-devel at gluster.org
>> http://lists.gluster.org/mailman/listinfo/gluster-devel
>>
>
>
>
> --
> Raghavendra G
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-devel/attachments/20170608/a5b25f24/attachment.html>