<div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">On Wed, Jun 7, 2017 at 11:59 AM, Xavier Hernandez <span dir="ltr"><<a href="mailto:xhernandez@datalab.es" target="_blank">xhernandez@datalab.es</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Hi Krutika,<span class="gmail-"><br>
<br>
On 06/06/17 13:35, Krutika Dhananjay wrote:<br>
</span><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><span class="gmail-">
Hi,<br>
<br>
As part of identifying performance bottlenecks within gluster stack for<br>
VM image store use-case, I loaded io-stats at multiple points on the<br>
client and brick stack and ran randrd test using fio from within the<br>
hosted vms in parallel.<br>
<br>
Before I get to the results, a little bit about the configuration ...<br>
<br>
3 node cluster; 1x3 plain replicate volume with group virt settings,<br>
direct-io.<br>
3 FUSE clients, one per node in the cluster (which implies reads are<br>
served from the replica that is local to the client).<br>
<br>
io-stats was loaded at the following places:<br>
On the client stack: Above client-io-threads and above protocol/client-0<br>
(the first child of AFR).<br>
On the brick stack: Below protocol/server, above and below io-threads<br>
and just above storage/posix.<br>
<br>
Based on a 60-second run of randrd test and subsequent analysis of the<br>
stats dumped by the individual io-stats instances, the following is what<br>
I found:<br>
<br></span>
_*Translator Position*_* *_*Avg Latency of READ<br>
fop as seen by this translator*_<span class="gmail-"><br>
<br>
1. parent of client-io-threads 1666us<br>
<br>
∆ (1,2) = 50us<br>
<br>
2. parent of protocol/client-0 1616us<br>
<br>
∆(2,3) = 1453us<br>
<br>
----------------- end of client stack ---------------------<br>
----------------- beginning of brick stack -----------<br>
<br>
3. child of protocol/server 163us<br>
<br>
∆(3,4) = 7us<br>
<br>
4. parent of io-threads 156us<br>
<br>
∆(4,5) = 20us<br>
<br>
5. child-of-io-threads 136us<br>
<br>
∆ (5,6) = 11us<br>
<br>
6. parent of storage/posix 125us<br>
...<br>
---------------- end of brick stack ------------------------<br>
<br>
So it seems like the biggest bottleneck here is a combination of the<br>
network + epoll, rpc layer?<br>
I must admit I am no expert with networks, but I'm assuming if the<br>
client is reading from the local brick, then<br>
even latency contribution from the actual network won't be much, in<br>
which case bulk of the latency is coming from epoll, rpc layer, etc at<br>
both client and brick end? Please correct me if I'm wrong.<br>
<br>
I will, of course, do some more runs and confirm if the pattern is<br>
consistent.<br>
</span></blockquote>
<br>
very interesting. These results are similar to what I also observed when doing some ec tests.<br></blockquote><div><br></div><div>For EC we've found [1] to increase the performance. Though not sure whether it'll have any significant impact for replicated setups.<br><br></div><br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
My personal feeling is that there's high serialization and/or contention in the network layer caused by mutexes, but I don't have data to support that.<br></blockquote><div><br><div>As to lock contention or lack of concurrency at socket/rpc layers, AFAIK we've following suspects in I/O path (as opposed to accepting/listen paths):<br><br></div><div>* Only one of reading from socket, writing to socket, error handling on socket, voluntary shutdown of sockets (through shutdown) can be in progress at a time. IOW, these operations are not concurrent as each one of them acquires a lock contended by others. My gut feeling is that at least reading from socket and writing to socket can be made concurrent and I've to spend more time on this to have a definitive answer.<br><br></div><div>* Till [1], handler also incurred cost of message processing by higher layers (not just the cost of reading a msg from socket). Since we've epoll configured with EPOLL_ONESHOT and add back socket only after handler completes there was a lag after one msg is read before another msg can be read from same socket.<br><br></div><div>* EPOLL_ONESHOT also means processing of one event (say POLLIN) also excludes other events (like POLLOUT when lots of msgs waiting to be written to socket) till the event is processed. The vice-versa scenario - reads blocked when writes are pending on a socket and a POLLOUT is received - is also true here. I think this is another area where we can improve.<br><br></div><div>Will update the thread as and when I can think of a valid suspect.<br></div><div><br>[1] <a href="https://review.gluster.org/17391">https://review.gluster.org/17391</a> <br></div>
<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<br>
Xavi<div class="gmail-HOEnZb"><div class="gmail-h5"><br>
<br>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<br>
-Krutika<br>
<br>
<br>
______________________________<wbr>_________________<br>
Gluster-devel mailing list<br>
<a href="mailto:Gluster-devel@gluster.org" target="_blank">Gluster-devel@gluster.org</a><br>
<a href="http://lists.gluster.org/mailman/listinfo/gluster-devel" rel="noreferrer" target="_blank">http://lists.gluster.org/mailm<wbr>an/listinfo/gluster-devel</a><br>
<br>
</blockquote>
<br>
______________________________<wbr>_________________<br>
Gluster-devel mailing list<br>
<a href="mailto:Gluster-devel@gluster.org" target="_blank">Gluster-devel@gluster.org</a><br>
<a href="http://lists.gluster.org/mailman/listinfo/gluster-devel" rel="noreferrer" target="_blank">http://lists.gluster.org/mailm<wbr>an/listinfo/gluster-devel</a></div></div></blockquote></div><br><br clear="all"><br>-- <br><div class="gmail_signature">Raghavendra G<br></div>
</div></div>