[Gluster-users] Lots of connections on clients - appropriate values for various thread parameters

Fri Mar 29 05:29:07 UTC 2019

+Gluster-users <gluster-users at gluster.org>

Sorry about the delay. There is nothing suspicious about per thread CPU
utilization of glusterfs process. However looking at the volume profile
attached I see huge number of lookups. I think if we cutdown the number of
lookups probably we'll see improvements in performance. I need following
information:

* dump of fuse traffic under heavy load (use --dump-fuse option while
mounting)
* client volume profile for the duration of heavy load -
https://docs.gluster.org/en/latest/Administrator%20Guide/Performance%20Testing/
* corresponding brick volume profile

Basically I need to find out
* whether these lookups are on existing files or non-existent files
* whether they are on directories or files
* why/whether md-cache or kernel attribute cache or nl-cache will help to
cut down lookups.

regards,
Raghavendra

On Mon, Mar 25, 2019 at 12:13 PM Hu Bert <revirii at googlemail.com> wrote:

> Hi Raghavendra,
>
> sorry, this took a while. The last weeks the weather was bad -> less
> traffic, but this weekend there was a massive peak. I made 3 profiles
> with top, but at first look there's nothing special here.
>
> I also made a gluster profile (on one of the servers) at a later
> moment. Maybe that helps. I also added some munin graphics from 2 of
> the clients and 1 graphic of server network, just to show how massive
> the problem is.
>
> Just wondering if the high io wait is related to the high network
> traffic bug (https://bugzilla.redhat.com/show_bug.cgi?id=1673058); if
> so, i could deactivate performance.quick-read and check if there is
> less iowait. If that helps: wonderful - and yearningly awaiting
> updated packages (e.g. v5.6). If not: maybe we have to switch from our
> normal 10TB hdds (raid10) to SSDs if the problem is based on slow
> hardware in the use case of small files (images).
>
>
> Thx,
> Hubert
>
> Am Mo., 4. März 2019 um 16:59 Uhr schrieb Raghavendra Gowdappa
> <rgowdapp at redhat.com>:
> >
> > Were you seeing high Io-wait when you captured the top output? I guess
> not as you mentioned the load increases during weekend. Please note that
> this data has to be captured when you are experiencing problems.
> >
> > On Mon, Mar 4, 2019 at 8:02 PM Hu Bert <revirii at googlemail.com> wrote:
> >>
> >> Hi,
> >> sending the link directly to  you and not the list, you can distribute
> >> if necessary. the command ran for about half a minute. Is that enough?
> >> More? Less?
> >>
> >> https://download.outdooractive.com/top.output.tar.gz
> >>
> >> Am Mo., 4. März 2019 um 15:21 Uhr schrieb Raghavendra Gowdappa
> >> <rgowdapp at redhat.com>:
> >> >
> >> >
> >> >
> >> > On Mon, Mar 4, 2019 at 7:47 PM Raghavendra Gowdappa <
> rgowdapp at redhat.com> wrote:
> >> >>
> >> >>
> >> >>
> >> >> On Mon, Mar 4, 2019 at 4:26 PM Hu Bert <revirii at googlemail.com>
> wrote:
> >> >>>
> >> >>> Hi Raghavendra,
> >> >>>
> >> >>> at the moment iowait and cpu consumption is quite low, the main
> >> >>> problems appear during the weekend (high traffic, especially on
> >> >>> sunday), so either we have to wait until next sunday or use a time
> >> >>> machine ;-)
> >> >>>
> >> >>> I made a screenshot of top (https://abload.de/img/top-hvvjt2.jpg)
> and
> >> >>> a text output (https://pastebin.com/TkTWnqxt), maybe that helps.
> Seems
> >> >>> like processes like glfs_fuseproc (>204h) and glfs_epoll (64h for
> each
> >> >>> process) consume a lot of CPU (uptime 24 days). Is that already
> >> >>> helpful?
> >> >>
> >> >>
> >> >> Not much. The TIME field just says the amount of time the thread has
> been executing. Since its a long standing mount, we can expect such large
> values. But, the value itself doesn't indicate whether the thread itself
> was overloaded at any (some) interval(s).
> >> >>
> >> >> Can you please collect output of following command and send back the
> collected data?
> >> >>
> >> >> # top -bHd 3 > top.output
> >> >
> >> >
> >> > Please collect this on problematic mounts and bricks.
> >> >
> >> >>
> >> >>>
> >> >>>
> >> >>> Hubert
> >> >>>
> >> >>> Am Mo., 4. März 2019 um 11:31 Uhr schrieb Raghavendra Gowdappa
> >> >>> <rgowdapp at redhat.com>:
> >> >>> >
> >> >>> > what is the per thread CPU usage like on these clients? With
> highly concurrent workloads we've seen single thread that reads requests
> from /dev/fuse (fuse reader thread) becoming bottleneck. Would like to know
> what is the cpu usage of this thread looks like (you can use top -H).
> >> >>> >
> >> >>> > On Mon, Mar 4, 2019 at 3:39 PM Hu Bert <revirii at googlemail.com>
> wrote:
> >> >>> >>
> >> >>> >> Good morning,
> >> >>> >>
> >> >>> >> we use gluster v5.3 (replicate with 3 servers, 2 volumes, raid10
> as
> >> >>> >> brick) with at the moment 10 clients; 3 of them do heavy I/O
> >> >>> >> operations (apache tomcats, read+write of (small) images). These
> 3
> >> >>> >> clients have a quite high I/O wait (stats from yesterday) as can
> be
> >> >>> >> seen here:
> >> >>> >>
> >> >>> >> client: https://abload.de/img/client1-cpu-dayulkza.png
> >> >>> >> server: https://abload.de/img/server1-cpu-dayayjdq.png
> >> >>> >>
> >> >>> >> The iowait in the graphics differ a lot. I checked netstat for
> the
> >> >>> >> different clients; the other clients have 8 open connections:
> >> >>> >> https://pastebin.com/bSN5fXwc
> >> >>> >>
> >> >>> >> 4 for each server and each volume. The 3 clients with the heavy
> I/O
> >> >>> >> have (at the moment) according to netstat 170, 139 and 153
> >> >>> >> connections. An example for one client can be found here:
> >> >>> >> https://pastebin.com/2zfWXASZ
> >> >>> >>
> >> >>> >> gluster volume info: https://pastebin.com/13LXPhmd
> >> >>> >> gluster volume status: https://pastebin.com/cYFnWjUJ
> >> >>> >>
> >> >>> >> I just was wondering if the iowait is based on the clients and
> their
> >> >>> >> workflow: requesting a lot of files (up to hundreds per second),
> >> >>> >> opening a lot of connections and the servers aren't able to
> answer
> >> >>> >> properly. Maybe something can be tuned here?
> >> >>> >>
> >> >>> >> Especially the server|client.event-threads (both set to 4) and
> >> >>> >> performance.(high|normal|low|least)-prio-threads (all at default
> value
> >> >>> >> 16) and performance.io-thread-count (32) options, maybe these
> aren't
> >> >>> >> properly configured for up to 170 client connections.
> >> >>> >>
> >> >>> >> Both servers and clients have a Xeon CPU (6 cores, 12 threads),
> a 10
> >> >>> >> GBit connection and 128G (servers) respectively 256G (clients)
> RAM.
> >> >>> >> Enough power :-)
> >> >>> >>
> >> >>> >>
> >> >>> >> Thx for reading && best regards,
> >> >>> >>
> >> >>> >> Hubert
> >> >>> >> _______________________________________________
> >> >>> >> Gluster-users mailing list
> >> >>> >> Gluster-users at gluster.org
> >> >>> >> https://lists.gluster.org/mailman/listinfo/gluster-users
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20190329/78fbfd5c/attachment.html>