<div dir="auto"><div><div class="gmail_quote"><div dir="ltr">On Fri, 25 Jan 2019, 08:53 Vijay Bellur <<a href="mailto:vbellur@redhat.com">vbellur@redhat.com</a> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div>Thank you for the detailed update, Xavi! This looks very interesting. </div><br><div class="gmail_quote"><div dir="ltr" class="m_-2825772463141807508gmail_attr">On Thu, Jan 24, 2019 at 7:50 AM Xavi Hernandez <<a href="mailto:xhernandez@redhat.com" target="_blank" rel="noreferrer">xhernandez@redhat.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div dir="ltr"><div dir="ltr">Hi all,<div><br></div><div>I've just updated a patch [1] that implements a new thread pool based on a wait-free queue provided by userspace-rcu library. The patch also includes an auto scaling mechanism that only keeps running the needed amount of threads for the current workload.</div><div><br></div><div>This new approach has some advantages:</div></div><div><ul><li>It's provided globally inside libglusterfs instead of inside an xlator<br></li></ul></div><blockquote style="margin:0px 0px 0px 40px;border:none;padding:0px"><div dir="ltr"><div>This makes it possible that fuse thread and epoll threads transfer the received request to another thread sooner, wating less CPU and reacting sooner to other incoming requests.</div></div></blockquote><ul><li>Adding jobs to the queue used by the thread pool only requires an atomic operation</li></ul><div><blockquote style="margin:0px 0px 0px 40px;border:none;padding:0px"><div>This makes the producer side of the queue really fast, almost with no delay.</div></blockquote></div><ul><li>Contention is reduced</li></ul><div><blockquote style="margin:0px 0px 0px 40px;border:none;padding:0px"><div>The producer side has negligible contention thanks to the wait-free enqueue operation based on an atomic access. The consumer side requires a mutex, but the duration is very small and the scaling mechanism makes sure that there are no more threads than needed contending for the mutex.</div></blockquote></div><div><br></div>This change disables io-threads, since it replaces part of its functionality. However there are two things that could be needed from io-threads:<div><ul><li>Prioritization of fops</li></ul></div><blockquote style="margin:0px 0px 0px 40px;border:none;padding:0px"><div><div>Currently, io-threads assigns priorities to each fop, so that some fops are handled before than others.</div></div></blockquote><div><ul><li>Fair distribution of execution slots between clients</li></ul></div><blockquote style="margin:0px 0px 0px 40px;border:none;padding:0px"><div><div>Currently, io-threads processes requests from each client in round-robin.</div></div></blockquote><div><div dir="ltr"><div><br></div><div>These features are not implemented right now. If they are needed, probably the best thing to do would be to keep them inside io-threads, but change its implementation so that it uses the global threads from the thread pool instead of its own threads.</div></div></div></div></div></blockquote><div><br></div><div><br></div><div>These features are indeed useful to have and hence modifying the implementation of io-threads to provide this behavior would be welcome.</div><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div dir="ltr"><div dir="ltr"><br></div></div><div><div dir="ltr"><div><div dir="ltr"><div><br></div><div>These tests have shown that the limiting factor has been the disk in most cases, so it's hard to tell if the change has really improved things. There is only one clear exception: self-heal on a dispersed volume completes 12.7% faster. The utilization of CPU has also dropped drastically:</div><div><br></div></div></div><blockquote style="margin:0px 0px 0px 40px;border:none;padding:0px"><div><div dir="ltr"><div>Old implementation: <span style="color:rgb(0,0,0);font-family:monospace">12.30 user, 41.78 sys, 43.16 idle, 0.73 wait</span></div></div></div></blockquote><blockquote style="margin:0px 0px 0px 40px;border:none;padding:0px">New implementation: <span style="color:rgb(0,0,0);font-family:monospace">4.91 user, 5.52 sys, 81.60 idle, 5.91 wait</span></blockquote><span style="color:rgb(0,0,0);font-family:monospace"><div><span style="color:rgb(0,0,0);font-family:monospace"><br></span></div></span><div><div dir="ltr"><div>Now I'm running some more tests on NVMe to try to see the effects of the change when disk is not limiting performance. I'll update once I've more data.</div><div><br></div></div></div></div></div></div></blockquote><div><br></div><div>Will look forward to these numbers.</div></div></div></blockquote></div></div><div dir="auto"><br></div><div dir="auto">I have identified an issue that limits the number of active threads when load is high, causing some regressions. I'll fix it and rerun the tests on Monday.</div><div dir="auto"><br></div><div dir="auto">Xavi</div><div dir="auto"><br></div><div dir="auto"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div class="gmail_quote"><div><br></div><div><br></div><div>Regards,</div><div>Vijay </div></div></div>
</blockquote></div></div></div>