[Gluster-users] 100% CPU WAIT

Wed Oct 8 05:48:28 UTC 2014

On 10/06/2014 05:04 PM, Jocelyn Hotte wrote:
> Hi Tom,
> Which version of Gluster are you running? I talked with my operations team, and they don't seem to recall a log entry afr_dir_exclusive_crawl. But AFR seems like a self-heal.
> Therefore I suspect you're using Gluster in a very similar way that we do, which means a lot of file entries in a single folder.
>
> In our case, we have several million files in our Gluster cluster, and when a self-heal hits, we can kiss our Gluster goodbye for a couple of hours.
> We had to disable the self-heal mechanisms on our clusters to prevent that, which provoked an increased chance of split-brain files on our clusters.
> We then developed a daemon to heal these split-brains ourselves.
hi Jocelyn,
       We changed entry-self-heal algorithm in 3.6 to improve this 
behaviour. We would love your feedback with the latest beta3 bits for 
3.6 for this use case.

Pranith.
>
> -----Original Message-----
> From: Tom van Leeuwen [mailto:tom.van.leeuwen at saasplaza.com]
> Sent: 6 octobre 2014 03:00
> To: Jocelyn Hotte; gluster-users at gluster.org
> Subject: Re: [Gluster-users] 100% CPU WAIT
>
> Hi Jocelyn,
>
> Thanks for your response. I noticed this 100% CPU WAIT on server01 and decided to reboot it.
> After booting I noticed these two messages:
> glustershd.log:[2014-10-03 05:05:46.969650] I [afr-self-heald.c:1180:afr_dir_exclusive_crawl] 0-myvol-replicate-0:
> Another crawl is in progress for myvol-client-0
> glustershd.log:[2014-10-03 05:05:46.970111] I [afr-self-heald.c:1180:afr_dir_exclusive_crawl] 0-myvol2-replicate-0:
> Another crawl is in progress for myvol2-client-0
>
> It is 2014-10-06 now. myvol is 493G in use and myvol2 is 4G in use. The 100% CPU WAIT is still there.
> I have no idea where it comes from and I have no idea if the afr-self-heald is still running.
>
> What triggered me here, is that I got a complaint that the initial performance was ~75MB/s throughput on a large write (time dd if=/dev/zero of=2G.bin bs=1M count=2048) and now is ~40MB/s
>
> [root at server01 glusterfs]# iostat -k 10 /dev/xvdb # The myvol disk
> Linux 2.6.32-358.6.2.el6.x86_64 (server01)     06-10-14 _x86_64_    (2 CPU)
>
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>              0,49    0,00    0,89    8,45    0,03   90,14
>
> Device:            tps    kB_read/s    kB_wrtn/s    kB_read kB_wrtn
> xvdb             19,25        67,80        89,64   18013783 23816408
>
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>              0,35    0,00    1,36   48,74    0,05   49,50
>
> Device:            tps    kB_read/s    kB_wrtn/s    kB_read kB_wrtn
> xvdb             70,70       282,80         0,00 2828          0
>
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>              0,35    0,00    0,90   48,95    0,00   49,80
>
> Device:            tps    kB_read/s    kB_wrtn/s    kB_read kB_wrtn
> xvdb             67,90       271,60         0,00 2716          0
>
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>              0,25    0,00    1,15   48,69    0,00   49,90
>
> Device:            tps    kB_read/s    kB_wrtn/s    kB_read kB_wrtn
> xvdb             67,20       268,80         0,00 2688          0
>
> I have no idea what it is doing or how to proceed with this issue.
>
> On 03-10-14 16:58, Jocelyn Hotte wrote:
>> Hi Tom,
>> We experience this behavior when a self heal is running after a bad communication between 2 nodes, or after a node crashed.
>>
>> How we diagnose it is usually by looking into the mount log (tail -f
>> /var/log/gluster/mnt-log), and you should see entries such as afr ...
>> self-heal
>>
>> -----Original Message-----
>> From: gluster-users-bounces at gluster.org
>> [mailto:gluster-users-bounces at gluster.org] On Behalf Of Tom van
>> Leeuwen
>> Sent: 3 octobre 2014 06:00
>> To: gluster-users at gluster.org
>> Subject: [Gluster-users] 100% CPU WAIT
>>
>> Hi guys,
>>
>> My glusterfs is causing 100% CPU WAIT according to `top`.
>> This has been going on for hours and I have no idea what is causing it.
>> How can I troubleshoot?
>>
>> Iotop reports this:
>> Total DISK READ: 268.60 K/s | Total DISK WRITE: 0.00 B/s
>>      TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO COMMAND
>>     7899 be/4 root      268.60 K/s    0.00 B/s  0.00 % 96.70 % glusterfsd
>> -s server01 --volfile-id myvol.server01.glusterfs-brick1 -p
>> /var/lib/glusterd/vols/myvol/run/server01-glusterfs-brick1.pid -S
>> /var/run/a7562806405853d2b9382d6fc59051cc.socket --brick-name
>> /glusterfs/brick1 -l /var/log/glusterfs/bricks/glusterfs-brick1.log
>> --xlator-option
>> *-posix.glusterd-uuid=07acd5b2-85e6-46f1-8477-038028e8ef7f
>> --brick-port
>> 49152 --xlator-option myvol-server.listen-port=49152
>>     1885 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.98 % glusterfsd
>> -s server01 --volfile-id myvol.server01.glusterfs-brick1 -p
>> /var/lib/glusterd/vols/myvol/run/server01-glusterfs-brick1.pid -S
>> /var/run/a7562806405853d2b9382d6fc59051cc.socket --brick-name
>> /glusterfs/brick1 -l /var/log/glusterfs/bricks/glusterfs-brick1.log
>> --xlator-option
>> *-posix.glusterd-uuid=07acd5b2-85e6-46f1-8477-038028e8ef7f
>> --brick-port
>> 49152 --xlator-option myvol-server.listen-port=49152
>>
>> Kind regards,
>> Tom van Leeuwen
>>
>> _______________________________________________
>> Gluster-users mailing list
>> Gluster-users at gluster.org
>> http://supercolony.gluster.org/mailman/listinfo/gluster-users
>> _______________________________________________
>> Gluster-users mailing list
>> Gluster-users at gluster.org
>> http://supercolony.gluster.org/mailman/listinfo/gluster-users
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://supercolony.gluster.org/mailman/listinfo/gluster-users