[Gluster-devel] One client can effectively hang entire gluster array

Tue Jul 12 05:28:54 UTC 2016

On Fri, Jul 8, 2016 at 8:02 PM, Jeff Darcy <jdarcy at redhat.com> wrote:

> > In either of these situations, one glusterfsd process on whatever peer
> the
> > client is currently talking to will skyrocket to *nproc* cpu usage (800%,
> > 1600%) and the storage cluster is essentially useless; all other clients
> > will eventually try to read or write data to the overloaded peer and,
> when
> > that happens, their connection will hang. Heals between peers hang
> because
> > the load on the peer is around 1.5x the number of cores or more. This
> occurs
> > in either gluster 3.6 or 3.7, is very repeatable, and happens much too
> > frequently.
>
> I have some good news and some bad news.
>
> The good news is that features to address this are already planned for the
> 4.0 release.  Primarily I'm referring to QoS enhancements, some parts of
> which were already implemented for the bitrot daemon.  I'm still working
> out the exact requirements for this as a general facility, though.  You
> can help!  :)  Also, some of the work on "brick multiplexing" (multiple
> bricks within one glusterfsd process) should help to prevent the thrashing
> that causes a complete freeze-up.
>
> Now for the bad news.  Did I mention that these are 4.0 features?  4.0 is
> not near term, and not getting any nearer as other features and releases
> keep "jumping the queue" to absorb all of the resources we need for 4.0
> to happen.  Not that I'm bitter or anything.  ;)  To address your more
> immediate concerns, I think we need to consider more modest changes that
> can be completed in more modest time.  For example:
>
>  * The load should *never* get to 1.5x the number of cores.  Perhaps we
>    could tweak the thread-scaling code in io-threads and epoll to check
>    system load and not scale up (or even scale down) if system load is
>    already high.
>
>  * We might be able to tweak io-threads (which already runs on the
>    bricks and already has a global queue) to schedule requests in a
>    fairer way across clients.  Right now it executes them in the
>    same order that they were read from the network.

This sounds to be an easier fix. We can make io-threads to factor in
another input i.e., the client through which request came in (essentially
frame->root->client) before scheduling. That should make the problem
bearable at-least if not crippling. As to what algorithm to use, I think we
can consider leaky bucket of bit-rot implementation or dmclock. I've not
really thought deeper about the algorithm part. If the approach sounds ok,
we can discuss more about algos.

That tends to
>    be a bit "unfair" and that should be fixed in the network code,
>    but that's a much harder task.
>
> These are only weak approximations of what we really should be doing,
> and will be doing in the long term, but (without making any promises)
> they might be sufficient and achievable in the near term.  Thoughts?
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
>

-- 
Raghavendra G
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.gluster.org/pipermail/gluster-devel/attachments/20160712/2b8329e9/attachment.html>