<div dir="ltr"><div>There are a lot of Lookup operations in the system. But I am not able to find why. Could you check the output of</div><div><br></div><div># gluster volume heal <volname> info | grep -i number</div><div><br></div><div>it should print all zeros.<br></div></div><br><div class="gmail_quote"><div dir="ltr">On Fri, Aug 17, 2018 at 1:49 PM Hu Bert <<a href="mailto:revirii@googlemail.com">revirii@googlemail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">I don't know what you exactly mean with workload, but the main<br>
function of the volume is storing (incl. writing, reading) images<br>
(from hundreds of bytes up to 30 MBs, overall ~7TB). The work is done<br>
by apache tomcat servers writing to / reading from the volume. Besides<br>
images there are some text files and binaries that are stored on the<br>
volume and get updated regularly (every x hours); we'll try to migrate<br>
the latter ones to local storage asap.<br>
<br>
Interestingly it's only one process (and its threads) of the same<br>
brick on 2 of the gluster servers that consumes the CPU.<br>
<br>
gluster11: bricksdd1; not healed; full CPU<br>
gluster12: bricksdd1; got healed; normal CPU<br>
gluster13: bricksdd1; got healed; full CPU<br>
<br>
Besides: performance during heal (e.g. gluster12, bricksdd1) was way<br>
better than it is now. I've attached 2 pngs showing the differing cpu<br>
usage of last week before/after heal.<br>
<br>
2018-08-17 9:30 GMT+02:00 Pranith Kumar Karampuri <<a href="mailto:pkarampu@redhat.com" target="_blank">pkarampu@redhat.com</a>>:<br>
> There seems to be too many lookup operations compared to any other<br>
> operations. What is the workload on the volume?<br>
><br>
> On Fri, Aug 17, 2018 at 12:47 PM Hu Bert <<a href="mailto:revirii@googlemail.com" target="_blank">revirii@googlemail.com</a>> wrote:<br>
>><br>
>> i hope i did get it right.<br>
>><br>
>> gluster volume profile shared start<br>
>> wait 10 minutes<br>
>> gluster volume profile shared info<br>
>> gluster volume profile shared stop<br>
>><br>
>> If that's ok, i've attached the output of the info command.<br>
>><br>
>><br>
>> 2018-08-17 8:31 GMT+02:00 Pranith Kumar Karampuri <<a href="mailto:pkarampu@redhat.com" target="_blank">pkarampu@redhat.com</a>>:<br>
>> > Please do volume profile also for around 10 minutes when CPU% is high.<br>
>> ><br>
>> > On Fri, Aug 17, 2018 at 11:56 AM Pranith Kumar Karampuri<br>
>> > <<a href="mailto:pkarampu@redhat.com" target="_blank">pkarampu@redhat.com</a>> wrote:<br>
>> >><br>
>> >> As per the output, all io-threads are using a lot of CPU. It is better<br>
>> >> to<br>
>> >> check what the volume profile is to see what is leading to so much work<br>
>> >> for<br>
>> >> io-threads. Please follow the documentation at<br>
>> >><br>
>> >> <a href="https://gluster.readthedocs.io/en/latest/Administrator%20Guide/Monitoring%20Workload/" rel="noreferrer" target="_blank">https://gluster.readthedocs.io/en/latest/Administrator%20Guide/Monitoring%20Workload/</a><br>
>> >> section: "<br>
>> >><br>
>> >> Running GlusterFS Volume Profile Command"<br>
>> >><br>
>> >> and attach output of "gluster volume profile info",<br>
>> >><br>
>> >> On Fri, Aug 17, 2018 at 11:24 AM Hu Bert <<a href="mailto:revirii@googlemail.com" target="_blank">revirii@googlemail.com</a>><br>
>> >> wrote:<br>
>> >>><br>
>> >>> Good morning,<br>
>> >>><br>
>> >>> i ran the command during 100% CPU usage and attached the file.<br>
>> >>> Hopefully it helps.<br>
>> >>><br>
>> >>> 2018-08-17 7:33 GMT+02:00 Pranith Kumar Karampuri<br>
>> >>> <<a href="mailto:pkarampu@redhat.com" target="_blank">pkarampu@redhat.com</a>>:<br>
>> >>> > Could you do the following on one of the nodes where you are<br>
>> >>> > observing<br>
>> >>> > high<br>
>> >>> > CPU usage and attach that file to this thread? We can find what<br>
>> >>> > threads/processes are leading to high usage. Do this for say 10<br>
>> >>> > minutes<br>
>> >>> > when<br>
>> >>> > you see the ~100% CPU.<br>
>> >>> ><br>
>> >>> > top -bHd 5 > /tmp/top.${HOSTNAME}.txt<br>
>> >>> ><br>
>> >>> > On Wed, Aug 15, 2018 at 2:37 PM Hu Bert <<a href="mailto:revirii@googlemail.com" target="_blank">revirii@googlemail.com</a>><br>
>> >>> > wrote:<br>
>> >>> >><br>
>> >>> >> Hello again :-)<br>
>> >>> >><br>
>> >>> >> The self heal must have finished as there are no log entries in<br>
>> >>> >> glustershd.log files anymore. According to munin disk latency<br>
>> >>> >> (average<br>
>> >>> >> io wait) has gone down to 100 ms, and disk utilization has gone<br>
>> >>> >> down<br>
>> >>> >> to ~60% - both on all servers and hard disks.<br>
>> >>> >><br>
>> >>> >> But now system load on 2 servers (which were in the good state)<br>
>> >>> >> fluctuates between 60 and 100; the server with the formerly failed<br>
>> >>> >> disk has a load of 20-30.I've uploaded some munin graphics of the<br>
>> >>> >> cpu<br>
>> >>> >> usage:<br>
>> >>> >><br>
>> >>> >> <a href="https://abload.de/img/gluster11_cpu31d3a.png" rel="noreferrer" target="_blank">https://abload.de/img/gluster11_cpu31d3a.png</a><br>
>> >>> >> <a href="https://abload.de/img/gluster12_cpu8sem7.png" rel="noreferrer" target="_blank">https://abload.de/img/gluster12_cpu8sem7.png</a><br>
>> >>> >> <a href="https://abload.de/img/gluster13_cpud7eni.png" rel="noreferrer" target="_blank">https://abload.de/img/gluster13_cpud7eni.png</a><br>
>> >>> >><br>
>> >>> >> This can't be normal. 2 of the servers under heavy load and one not<br>
>> >>> >> that much. Does anyone have an explanation of this strange<br>
>> >>> >> behaviour?<br>
>> >>> >><br>
>> >>> >><br>
>> >>> >> Thx :-)<br>
>> >>> >><br>
>> >>> >> 2018-08-14 9:37 GMT+02:00 Hu Bert <<a href="mailto:revirii@googlemail.com" target="_blank">revirii@googlemail.com</a>>:<br>
>> >>> >> > Hi there,<br>
>> >>> >> ><br>
>> >>> >> > well, it seems the heal has finally finished. Couldn't see/find<br>
>> >>> >> > any<br>
>> >>> >> > related log message; is there such a message in a specific log<br>
>> >>> >> > file?<br>
>> >>> >> ><br>
>> >>> >> > But i see the same behaviour when the last heal finished: all CPU<br>
>> >>> >> > cores are consumed by brick processes; not only by the formerly<br>
>> >>> >> > failed<br>
>> >>> >> > bricksdd1, but by all 4 brick processes (and their threads). Load<br>
>> >>> >> > goes<br>
>> >>> >> > up to > 100 on the 2 servers with the not-failed brick, and<br>
>> >>> >> > glustershd.log gets filled with a lot of entries. Load on the<br>
>> >>> >> > server<br>
>> >>> >> > with the then failed brick not that high, but still ~60.<br>
>> >>> >> ><br>
>> >>> >> > Is this behaviour normal? Is there some post-heal after a heal<br>
>> >>> >> > has<br>
>> >>> >> > finished?<br>
>> >>> >> ><br>
>> >>> >> > thx in advance :-)<br>
>> >>> ><br>
>> >>> ><br>
>> >>> ><br>
>> >>> > --<br>
>> >>> > Pranith<br>
>> >><br>
>> >><br>
>> >><br>
>> >> --<br>
>> >> Pranith<br>
>> ><br>
>> ><br>
>> ><br>
>> > --<br>
>> > Pranith<br>
><br>
><br>
><br>
> --<br>
> Pranith<br>
</blockquote></div><br clear="all"><br>-- <br><div dir="ltr" class="gmail_signature" data-smartmail="gmail_signature"><div dir="ltr">Pranith<br></div></div>