<div dir="ltr"><div dir="ltr">On Sun, Jan 27, 2019 at 8:03 AM Xavi Hernandez &lt;<a href="mailto:xhernandez@redhat.com">xhernandez@redhat.com</a>&gt; wrote:<br></div><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="auto"><div><div class="gmail_quote"><div dir="ltr">On Fri, 25 Jan 2019, 08:53 Vijay Bellur &lt;<a href="mailto:vbellur@redhat.com" target="_blank">vbellur@redhat.com</a> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div>Thank you for the detailed update, Xavi! This looks very interesting. </div><br><div class="gmail_quote"><div dir="ltr" class="gmail-m_-4156925023005747689m_-2825772463141807508gmail_attr">On Thu, Jan 24, 2019 at 7:50 AM Xavi Hernandez &lt;<a href="mailto:xhernandez@redhat.com" rel="noreferrer" target="_blank">xhernandez@redhat.com</a>&gt; wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div dir="ltr"><div dir="ltr">Hi all,<div><br></div><div>I&#39;ve just updated a patch [1] that implements a new thread pool based on a wait-free queue provided by userspace-rcu library. The patch also includes an auto scaling mechanism that only keeps running the needed amount of threads for the current workload.</div><div><br></div><div>This new approach has some advantages:</div></div><div><ul><li>It&#39;s provided globally inside libglusterfs instead of inside an xlator<br></li></ul></div><blockquote style="margin:0px 0px 0px 40px;border:none;padding:0px"><div dir="ltr"><div>This makes it possible that fuse thread and epoll threads transfer the received request to another thread sooner, wating less CPU and reacting sooner to other incoming requests.</div></div></blockquote><ul><li>Adding jobs to the queue used by the thread pool only requires an atomic operation</li></ul><div><blockquote style="margin:0px 0px 0px 40px;border:none;padding:0px"><div>This makes the producer side of the queue really fast, almost with no delay.</div></blockquote></div><ul><li>Contention is reduced</li></ul><div><blockquote style="margin:0px 0px 0px 40px;border:none;padding:0px"><div>The producer side has negligible contention thanks to the wait-free enqueue operation based on an atomic access. The consumer side requires a mutex, but the duration is very small and the scaling mechanism makes sure that there are no more threads than needed contending for the mutex.</div></blockquote></div><div><br></div>This change disables io-threads, since it replaces part of its functionality. However there are two things that could be needed from io-threads:<div><ul><li>Prioritization of fops</li></ul></div><blockquote style="margin:0px 0px 0px 40px;border:none;padding:0px"><div><div>Currently, io-threads assigns priorities to each fop, so that some fops are handled before than others.</div></div></blockquote><div><ul><li>Fair distribution of execution slots between clients</li></ul></div><blockquote style="margin:0px 0px 0px 40px;border:none;padding:0px"><div><div>Currently, io-threads processes requests from each client in round-robin.</div></div></blockquote><div><div dir="ltr"><div><br></div><div>These features are not implemented right now. If they are needed, probably the best thing to do would be to keep them inside io-threads, but change its implementation so that it uses the global threads from the thread pool instead of its own threads.</div></div></div></div></div></blockquote><div><br></div><div><br></div><div>These features are indeed useful to have and hence modifying the implementation of io-threads to provide this behavior would be welcome.</div><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div dir="ltr"><div dir="ltr"><br></div></div><div><div dir="ltr"><div><div dir="ltr"><div><br></div><div>These tests have shown that the limiting factor has been the disk in most cases, so it&#39;s hard to tell if the change has really improved things. There is only one clear exception: self-heal on a dispersed volume completes 12.7% faster. The utilization of CPU has also dropped drastically:</div><div><br></div></div></div><blockquote style="margin:0px 0px 0px 40px;border:none;padding:0px"><div><div dir="ltr"><div>Old implementation: <span style="color:rgb(0,0,0);font-family:monospace">12.30 user, 41.78 sys, 43.16 idle,  0.73 wait</span></div></div></div></blockquote><blockquote style="margin:0px 0px 0px 40px;border:none;padding:0px">New implementation: <span style="color:rgb(0,0,0);font-family:monospace">4.91 user,  5.52 sys, 81.60 idle,  5.91 wait</span></blockquote><span style="color:rgb(0,0,0);font-family:monospace"><div><span style="color:rgb(0,0,0);font-family:monospace"><br></span></div></span><div><div dir="ltr"><div>Now I&#39;m running some more tests on NVMe to try to see the effects of the change when disk is not limiting performance. I&#39;ll update once I&#39;ve more data.</div><div><br></div></div></div></div></div></div></blockquote><div><br></div><div>Will look forward to these numbers.</div></div></div></blockquote></div></div><div dir="auto"><br></div><div dir="auto">I have identified an issue that limits the number of active threads when load is high, causing some regressions. I&#39;ll fix it and rerun the tests on Monday.</div></div></blockquote><div><br></div><div>Once the issue was solved, it caused high load averages for some workloads that were actually causing a regression (too much I/O I guess) instead of improving performance. So I added a configurable maximum amount of threads and made the whole implementation optional, so that it can be safely used when required.</div><div><br></div><div>I did some tests and I was able to, at least, have the same performance we had before this patch in all cases. In some cases even better. But each test needed a manual configuration on the number of threads.</div><div><br></div><div>I need to work on a way to automatically compute the maximum so that it can be used easily in any workload (or even combined workloads).</div><div><br></div><div>I uploaded the latest version of the patch.</div><div><br></div><div>Xavi</div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="auto"><div dir="auto"><br></div><div dir="auto">Xavi</div><div dir="auto"><br></div><div dir="auto"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div class="gmail_quote"><div><br></div><div><br></div><div>Regards,</div><div>Vijay </div></div></div>

</blockquote></div></div></div>

</blockquote></div></div>