[Gluster-users] OOM Kills glustershd process in 3.10.1
Edvin Ekström
edvin.ekstrom at screen9.com
Thu Apr 27 08:21:13 UTC 2017
I've encountered the same issue, however in my case it seem to have been
caused by a bug in the kernel that was present between 4.4.0-58 -
4.4.0-63 (https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1655842),
seeing how you are running 4.4.0-62 I would suggest upgrading and see if
the error persists.
Edvin Ekström,
On 2017-04-26 09:09, Amudhan P wrote:
> I did volume start force and now self-heal daemon is up on the node
> which was down.
>
> But bitrot has triggered crawling process on all node now, why was it
> crawling disk again? if the process is running already.
>
> [output from bitd.log]
> [2017-04-13 06:01:23.930089] I
> [glusterfsd-mgmt.c:1778:mgmt_getspec_cbk] 0-glusterfs: No change in
> volfile, continuing
> [2017-04-26 06:51:46.998935] I [MSGID: 100030]
> [glusterfsd.c:2460:main] 0-/usr/local/sbin/glusterfs: Started running
> /usr/local/sbin/glusterfs version 3.10.1 (args:
> /usr/local/sbin/glusterfs -s localhost --volfile-id gluster/bitd -p
> /var/lib/glusterd/bitd/run/bitd.pid -l /var/log/glusterfs/bitd.log -S
> /var/run/gluster/02f1dd346d47b9006f9bf64e347338fd.socket
> --global-timer-wheel)
> [2017-04-26 06:51:47.002732] I [MSGID: 101190]
> [event-epoll.c:629:event_dispatch_epoll_worker] 0-epoll: Started
> thread with index 1
>
>
> On Tue, Apr 25, 2017 at 11:01 PM, Amudhan P <amudhan83 at gmail.com
> <mailto:amudhan83 at gmail.com>> wrote:
>
> Yes, I have enabled bitrot process and it's currently running
> signer process in some nodes.
>
> Disabling and enabling bitrot doesn't makes difference it will
> start crawl process again right.
>
>
> On Tuesday, April 25, 2017, Atin Mukherjee <amukherj at redhat.com
> <mailto:amukherj at redhat.com>> wrote:
> >
> >
> > On Tue, Apr 25, 2017 at 9:22 PM, Amudhan P <amudhan83 at gmail.com
> <mailto:amudhan83 at gmail.com>> wrote:
> >>
> >> Hi Pranith,
> >> if I restart glusterd service in the node alone will it work.
> bcoz I feel that doing volume force start will trigger bitrot
> process to crawl disks in all nodes.
> >
> > Have you enabled bitrot? If not then the process will not be in
> existence. As a workaround you can always disable this option
> before executing volume start force. Please note volume start
> force doesn't affect any running processes.
> >
> >>
> >> yes, rebalance fix layout is on process.
> >> regards
> >> Amudhan
> >>
> >> On Tue, Apr 25, 2017 at 9:15 PM, Pranith Kumar Karampuri
> <pkarampu at redhat.com <mailto:pkarampu at redhat.com>> wrote:
> >>>
> >>> You can restart the process using:
> >>> gluster volume start <volname> force
> >>>
> >>> Did shd on this node heal a lot of data? Based on the kind of
> memory usage it showed, seems like there is a leak.
> >>>
> >>>
> >>> Sunil,
> >>> Could you find if there any leaks in this particular
> version that we might have missed in our testing?
> >>>
> >>> On Tue, Apr 25, 2017 at 8:37 PM, Amudhan P
> <amudhan83 at gmail.com <mailto:amudhan83 at gmail.com>> wrote:
> >>>>
> >>>> Hi,
> >>>> In one of my node glustershd process is killed due to OOM and
> this happened only in one node out of 40 node cluster.
> >>>> Node running on Ubuntu 16.04.2.
> >>>> dmesg output:
> >>>> [Mon Apr 24 17:21:38 2017] nrpe invoked oom-killer:
> gfp_mask=0x26000c0, order=2, oom_score_adj=0
> >>>> [Mon Apr 24 17:21:38 2017] nrpe cpuset=/ mems_allowed=0
> >>>> [Mon Apr 24 17:21:38 2017] CPU: 0 PID: 12626 Comm: nrpe Not
> tainted 4.4.0-62-generic #83-Ubuntu
> >>>> [Mon Apr 24 17:21:38 2017] 0000000000000286 00000000fc26b170
> ffff88048bf27af0 ffffffff813f7c63
> >>>> [Mon Apr 24 17:21:38 2017] ffff88048bf27cc8 ffff88082a663c00
> ffff88048bf27b60 ffffffff8120ad4e
> >>>> [Mon Apr 24 17:21:38 2017] ffff88087781a870 ffff88087781a860
> ffffea0011285a80 0000000100000001
> >>>> [Mon Apr 24 17:21:38 2017] Call Trace:
> >>>> [Mon Apr 24 17:21:38 2017] [<ffffffff813f7c63>]
> dump_stack+0x63/0x90
> >>>> [Mon Apr 24 17:21:38 2017] [<ffffffff8120ad4e>]
> dump_header+0x5a/0x1c5
> >>>> [Mon Apr 24 17:21:38 2017] [<ffffffff811926c2>]
> oom_kill_process+0x202/0x3c0
> >>>> [Mon Apr 24 17:21:38 2017] [<ffffffff81192ae9>]
> out_of_memory+0x219/0x460
> >>>> [Mon Apr 24 17:21:38 2017] [<ffffffff81198a5d>]
> __alloc_pages_slowpath.constprop.88+0x8fd/0xa70
> >>>> [Mon Apr 24 17:21:38 2017] [<ffffffff81198e56>]
> __alloc_pages_nodemask+0x286/0x2a0
> >>>> [Mon Apr 24 17:21:38 2017] [<ffffffff81198f0b>]
> alloc_kmem_pages_node+0x4b/0xc0
> >>>> [Mon Apr 24 17:21:38 2017] [<ffffffff8107ea5e>]
> copy_process+0x1be/0x1b70
> >>>> [Mon Apr 24 17:21:38 2017] [<ffffffff8122d013>] ?
> __fd_install+0x33/0xe0
> >>>> [Mon Apr 24 17:21:38 2017] [<ffffffff81713d01>] ?
> release_sock+0x111/0x160
> >>>> [Mon Apr 24 17:21:38 2017] [<ffffffff810805a0>]
> _do_fork+0x80/0x360
> >>>> [Mon Apr 24 17:21:38 2017] [<ffffffff8122429c>] ?
> SyS_select+0xcc/0x110
> >>>> [Mon Apr 24 17:21:38 2017] [<ffffffff81080929>]
> SyS_clone+0x19/0x20
> >>>> [Mon Apr 24 17:21:38 2017] [<ffffffff818385f2>]
> entry_SYSCALL_64_fastpath+0x16/0x71
> >>>> [Mon Apr 24 17:21:38 2017] Mem-Info:
> >>>> [Mon Apr 24 17:21:38 2017] active_anon:553952
> inactive_anon:206987 isolated_anon:0
> >>>> active_file:3410764
> inactive_file:3460179 isolated_file:0
> >>>> unevictable:4914 dirty:212868
> writeback:0 unstable:0
> >>>> slab_reclaimable:386621
> slab_unreclaimable:31829
> >>>> mapped:6112 shmem:211
> pagetables:6178 bounce:0
> >>>> free:82623 free_pcp:213 free_cma:0
> >>>> [Mon Apr 24 17:21:38 2017] Node 0 DMA free:15880kB min:32kB
> low:40kB high:48kB active_anon:0kB inactive_anon:0k
> >>>> B active_file:0kB inactive_file:0kB unevictable:0kB
> isolated(anon):0kB isolated(file):0kB present:15964kB manag
> >>>> ed:15880kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB
> shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB
> >>>> kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB
> free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:
> >>>> 0kB pages_scanned:0 all_unreclaimable? yes
> >>>> [Mon Apr 24 17:21:38 2017] lowmem_reserve[]: 0 1868 31944
> 31944 31944
> >>>> [Mon Apr 24 17:21:38 2017] Node 0 DMA32 free:133096kB
> min:3948kB low:4932kB high:5920kB active_anon:170764kB in
> >>>> active_anon:206296kB active_file:394236kB
> inactive_file:525288kB unevictable:980kB isolated(anon):0kB isolated(
> >>>> file):0kB present:2033596kB managed:1952976kB mlocked:980kB
> dirty:1552kB writeback:0kB mapped:3904kB shmem:724k
> >>>> B slab_reclaimable:502176kB slab_unreclaimable:8916kB
> kernel_stack:1952kB pagetables:1408kB unstable:0kB bounce
> >>>> :0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
> writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
> >>>> [Mon Apr 24 17:21:38 2017] lowmem_reserve[]: 0 0 30076 30076
> 30076
> >>>> [Mon Apr 24 17:21:38 2017] Node 0 Normal free:181516kB
> min:63600kB low:79500kB high:95400kB active_anon:2045044
> >>>> kB inactive_anon:621652kB active_file:13248820kB
> inactive_file:13315428kB unevictable:18676kB isolated(anon):0kB
> isolated(file):0kB present:31322112kB managed:30798036kB
> mlocked:18676kB dirty:849920kB writeback:0kB mapped:20544kB
> shmem:120kB slab_reclaimable:1044308kB slab_unreclaimable:118400kB
> kernel_stack:33792kB pagetables:23304kB unstable:0kB bounce:0kB
> free_pcp:852kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB
> pages_scanned:0 all_unreclaimable? no
> >>>> [Mon Apr 24 17:21:38 2017] lowmem_reserve[]: 0 0 0 0 0
> >>>> [Mon Apr 24 17:21:38 2017] Node 0 DMA: 0*4kB 1*8kB (U) 0*16kB
> 0*32kB 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB
> >>>> 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15880kB
> >>>> [Mon Apr 24 17:21:38 2017] Node 0 DMA32: 18416*4kB (UME)
> 7480*8kB (UME) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*
> >>>> 512kB 0*1024kB 0*2048kB 0*4096kB = 133504kB
> >>>> [Mon Apr 24 17:21:38 2017] Node 0 Normal: 44972*4kB (UMEH)
> 13*8kB (EH) 13*16kB (H) 13*32kB (H) 8*64kB (H) 2*128
> >>>> kB (H) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 181384kB
> >>>> [Mon Apr 24 17:21:38 2017] Node 0 hugepages_total=0
> hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
> >>>> [Mon Apr 24 17:21:38 2017] Node 0 hugepages_total=0
> hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
> >>>> [Mon Apr 24 17:21:38 2017] 6878703 total pagecache pages
> >>>> [Mon Apr 24 17:21:38 2017] 2484 pages in swap cache
> >>>> [Mon Apr 24 17:21:38 2017] Swap cache stats: add 3533870,
> delete 3531386, find 3743168/4627884
> >>>> [Mon Apr 24 17:21:38 2017] Free swap = 14976740kB
> >>>> [Mon Apr 24 17:21:38 2017] Total swap = 15623164kB
> >>>> [Mon Apr 24 17:21:38 2017] 8342918 pages RAM
> >>>> [Mon Apr 24 17:21:38 2017] 0 pages HighMem/MovableOnly
> >>>> [Mon Apr 24 17:21:38 2017] 151195 pages reserved
> >>>> [Mon Apr 24 17:21:38 2017] 0 pages cma reserved
> >>>> [Mon Apr 24 17:21:38 2017] 0 pages hwpoisoned
> >>>> [Mon Apr 24 17:21:38 2017] [ pid ] uid tgid total_vm
> rss nr_ptes nr_pmds swapents oom_score_adj name
> >>>> [Mon Apr 24 17:21:38 2017] [ 566] 0 566 15064
> 460 33 3 1108 0 systemd
> >>>> -journal
> >>>> [Mon Apr 24 17:21:38 2017] [ 602] 0 602 23693
> 182 16 3 0 0 lvmetad
> >>>> [Mon Apr 24 17:21:38 2017] [ 613] 0 613 11241
> 589 21 3 264 -1000 systemd
> >>>> -udevd
> >>>> [Mon Apr 24 17:21:38 2017] [ 1381] 100 1381 25081
> 440 19 3 25 0 systemd
> >>>> -timesyn
> >>>> [Mon Apr 24 17:21:38 2017] [ 1447] 0 1447 1100
> 307 7 3 0 0 acpid
> >>>> [Mon Apr 24 17:21:38 2017] [ 1449] 0 1449 7252
> 374 21 3 47 0 cron
> >>>> [Mon Apr 24 17:21:38 2017] [ 1451] 0 1451 77253
> 994 19 3 10 0 lxcfs
> >>>> [Mon Apr 24 17:21:38 2017] [ 1483] 0 1483 6511
> 413 18 3 42 0 atd
> >>>> [Mon Apr 24 17:21:38 2017] [ 1505] 0 1505 7157
> 286 18 3 36 0 systemd
> >>>> -logind
> >>>> [Mon Apr 24 17:21:38 2017] [ 1508] 104 1508 64099
> 376 27 4 712 0 rsyslog
> >>>> d
> >>>> [Mon Apr 24 17:21:38 2017] [ 1510] 107 1510 10723
> 497 25 3 45 -900 dbus-da
> >>>> emon
> >>>> [Mon Apr 24 17:21:38 2017] [ 1521] 0 1521 68970
> 178 38 3 170 0 account
> >>>> s-daemon
> >>>> [Mon Apr 24 17:21:38 2017] [ 1526] 0 1526 6548
> 785 16 3 63 0 smartd
> >>>> [Mon Apr 24 17:21:38 2017] [ 1528] 0 1528 54412
> 146 31 5 1806 0 snapd
> >>>> [Mon Apr 24 17:21:38 2017] [ 1578] 0 1578 3416
> 335 11 3 24 0 mdadm
> >>>> [Mon Apr 24 17:21:38 2017] [ 1595] 0 1595 16380
> 470 35 3 157 -1000 sshd
> >>>> [Mon Apr 24 17:21:38 2017] [ 1610] 0 1610 69295
> 303 40 4 57 0 polkitd
> >>>> [Mon Apr 24 17:21:38 2017] [ 1618] 0 1618 1306
> 31 8 3 0 0 iscsid
> >>>> [Mon Apr 24 17:21:38 2017] [ 1619] 0 1619 1431
> 877 8 3 0 -17 iscsid
> >>>> [Mon Apr 24 17:21:38 2017] [ 1624] 0 1624 126363
> 8027 122 4 22441 0 gluster
> >>>> d
> >>>> [Mon Apr 24 17:21:38 2017] [ 1688] 0 1688 4884
> 430 15 3 46 0 irqbala
> >>>> nce
> >>>> [Mon Apr 24 17:21:38 2017] [ 1699] 0 1699 3985
> 348 13 3 0 0 agetty
> >>>> [Mon Apr 24 17:21:38 2017] [ 7001] 0 7001 500631
> 27874 145 5 3356 0 gluster
> >>>> fsd
> >>>> [Mon Apr 24 17:21:38 2017] [ 8136] 0 8136 500631
> 28760 141 5 2390 0 gluster
> >>>> fsd
> >>>> [Mon Apr 24 17:21:38 2017] [ 9280] 0 9280 533529
> 27752 135 5 3200 0 gluster
> >>>> fsd
> >>>> [Mon Apr 24 17:21:38 2017] [12626] 111 12626 5991
> 420 16 3 113 0 nrpe
> >>>> [Mon Apr 24 17:21:38 2017] [14342] 0 14342 533529
> 28377 135 5 2176 0 gluster
> >>>> fsd
> >>>> [Mon Apr 24 17:21:38 2017] [14361] 0 14361 534063
> 29190 136 5 1972 0 gluster
> >>>> fsd
> >>>> [Mon Apr 24 17:21:38 2017] [14380] 0 14380 533529
> 28104 136 6 2437 0 glusterfsd
> >>>> [Mon Apr 24 17:21:38 2017] [14399] 0 14399 533529
> 27552 131 5 2808 0 glusterfsd
> >>>> [Mon Apr 24 17:21:38 2017] [14418] 0 14418 533529
> 29588 138 5 2697 0 glusterfsd
> >>>> [Mon Apr 24 17:21:38 2017] [14437] 0 14437 517080
> 28671 146 5 2170 0 glusterfsd
> >>>> [Mon Apr 24 17:21:38 2017] [14456] 0 14456 533529
> 28083 139 5 3359 0 glusterfsd
> >>>> [Mon Apr 24 17:21:38 2017] [14475] 0 14475 533529
> 28054 134 5 2954 0 glusterfsd
> >>>> [Mon Apr 24 17:21:38 2017] [14494] 0 14494 533529
> 28594 135 5 2311 0 glusterfsd
> >>>> [Mon Apr 24 17:21:38 2017] [14513] 0 14513 533529
> 28911 138 5 2833 0 glusterfsd
> >>>> [Mon Apr 24 17:21:38 2017] [14532] 0 14532 533529
> 28259 134 6 3145 0 glusterfsd
> >>>> [Mon Apr 24 17:21:38 2017] [14551] 0 14551 533529
> 27875 138 5 2267 0 glusterfsd
> >>>> [Mon Apr 24 17:21:38 2017] [14570] 0 14570 484716
> 28247 142 5 2875 0 glusterfsd
> >>>> [Mon Apr 24 17:21:38 2017] [27646] 0 27646 3697561
> 202086 2830 17 16528 0 glusterfs
> >>>> [Mon Apr 24 17:21:38 2017] [27655] 0 27655 787371
> 29588 197 6 25472 0 glusterfs
> >>>> [Mon Apr 24 17:21:38 2017] [27665] 0 27665 689585
> 605 108 6 7008 0 glusterfs
> >>>> [Mon Apr 24 17:21:38 2017] [29878] 0 29878 193833
> 36054 241 4 41182 0 glusterfs
> >>>> [Mon Apr 24 17:21:38 2017] Out of memory: Kill process 27646
> (glusterfs) score 17 or sacrifice child
> >>>> [Mon Apr 24 17:21:38 2017] Killed process 27646 (glusterfs)
> total-vm:14790244kB, anon-rss:795040kB, file-rss:13304kB
> >>>> /var/log/glusterfs/glusterd.log
> >>>> [2017-04-24 11:53:51.359603] I [MSGID: 106006]
> [glusterd-svc-mgmt.c:327:glusterd_svc_common_rpc_notify]
> 0-management: glustershd has disconnected from glusterd.
> >>>> what would have gone wrong?
> >>>> regards
> >>>> Amudhan
> >>>>
> >>>> _______________________________________________
> >>>> Gluster-users mailing list
> >>>> Gluster-users at gluster.org <mailto:Gluster-users at gluster.org>
> >>>> http://lists.gluster.org/mailman/listinfo/gluster-users
> <http://lists.gluster.org/mailman/listinfo/gluster-users>
> >>>
> >>>
> >>>
> >>> --
> >>> Pranith
> >>
> >>
> >> _______________________________________________
> >> Gluster-users mailing list
> >> Gluster-users at gluster.org <mailto:Gluster-users at gluster.org>
> >> http://lists.gluster.org/mailman/listinfo/gluster-users
> <http://lists.gluster.org/mailman/listinfo/gluster-users>
> >
> >
>
>
>
>
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-users
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20170427/963f9b92/attachment.html>
More information about the Gluster-users
mailing list