[Gluster-users] OOM Kills glustershd process in 3.10.1
Amudhan P
amudhan83 at gmail.com
Wed Apr 26 07:09:12 UTC 2017
I did volume start force and now self-heal daemon is up on the node which
was down.
But bitrot has triggered crawling process on all node now, why was it
crawling disk again? if the process is running already.
[output from bitd.log]
[2017-04-13 06:01:23.930089] I [glusterfsd-mgmt.c:1778:mgmt_getspec_cbk]
0-glusterfs: No change in volfile, continuing
[2017-04-26 06:51:46.998935] I [MSGID: 100030] [glusterfsd.c:2460:main]
0-/usr/local/sbin/glusterfs: Started running /usr/local/sbin/glusterfs
version 3.10.1 (args: /usr/local/sbin/glusterfs -s localhost --volfile-id
gluster/bitd -p /var/lib/glusterd/bitd/run/bitd.pid -l
/var/log/glusterfs/bitd.log -S
/var/run/gluster/02f1dd346d47b9006f9bf64e347338fd.socket
--global-timer-wheel)
[2017-04-26 06:51:47.002732] I [MSGID: 101190]
[event-epoll.c:629:event_dispatch_epoll_worker] 0-epoll: Started thread
with index 1
On Tue, Apr 25, 2017 at 11:01 PM, Amudhan P <amudhan83 at gmail.com> wrote:
> Yes, I have enabled bitrot process and it's currently running signer
> process in some nodes.
>
> Disabling and enabling bitrot doesn't makes difference it will start crawl
> process again right.
>
>
> On Tuesday, April 25, 2017, Atin Mukherjee <amukherj at redhat.com> wrote:
> >
> >
> > On Tue, Apr 25, 2017 at 9:22 PM, Amudhan P <amudhan83 at gmail.com> wrote:
> >>
> >> Hi Pranith,
> >> if I restart glusterd service in the node alone will it work. bcoz I
> feel that doing volume force start will trigger bitrot process to crawl
> disks in all nodes.
> >
> > Have you enabled bitrot? If not then the process will not be in
> existence. As a workaround you can always disable this option before
> executing volume start force. Please note volume start force doesn't affect
> any running processes.
> >
> >>
> >> yes, rebalance fix layout is on process.
> >> regards
> >> Amudhan
> >>
> >> On Tue, Apr 25, 2017 at 9:15 PM, Pranith Kumar Karampuri <
> pkarampu at redhat.com> wrote:
> >>>
> >>> You can restart the process using:
> >>> gluster volume start <volname> force
> >>>
> >>> Did shd on this node heal a lot of data? Based on the kind of memory
> usage it showed, seems like there is a leak.
> >>>
> >>>
> >>> Sunil,
> >>> Could you find if there any leaks in this particular version
> that we might have missed in our testing?
> >>>
> >>> On Tue, Apr 25, 2017 at 8:37 PM, Amudhan P <amudhan83 at gmail.com>
> wrote:
> >>>>
> >>>> Hi,
> >>>> In one of my node glustershd process is killed due to OOM and this
> happened only in one node out of 40 node cluster.
> >>>> Node running on Ubuntu 16.04.2.
> >>>> dmesg output:
> >>>> [Mon Apr 24 17:21:38 2017] nrpe invoked oom-killer:
> gfp_mask=0x26000c0, order=2, oom_score_adj=0
> >>>> [Mon Apr 24 17:21:38 2017] nrpe cpuset=/ mems_allowed=0
> >>>> [Mon Apr 24 17:21:38 2017] CPU: 0 PID: 12626 Comm: nrpe Not tainted
> 4.4.0-62-generic #83-Ubuntu
> >>>> [Mon Apr 24 17:21:38 2017] 0000000000000286 00000000fc26b170
> ffff88048bf27af0 ffffffff813f7c63
> >>>> [Mon Apr 24 17:21:38 2017] ffff88048bf27cc8 ffff88082a663c00
> ffff88048bf27b60 ffffffff8120ad4e
> >>>> [Mon Apr 24 17:21:38 2017] ffff88087781a870 ffff88087781a860
> ffffea0011285a80 0000000100000001
> >>>> [Mon Apr 24 17:21:38 2017] Call Trace:
> >>>> [Mon Apr 24 17:21:38 2017] [<ffffffff813f7c63>] dump_stack+0x63/0x90
> >>>> [Mon Apr 24 17:21:38 2017] [<ffffffff8120ad4e>]
> dump_header+0x5a/0x1c5
> >>>> [Mon Apr 24 17:21:38 2017] [<ffffffff811926c2>]
> oom_kill_process+0x202/0x3c0
> >>>> [Mon Apr 24 17:21:38 2017] [<ffffffff81192ae9>]
> out_of_memory+0x219/0x460
> >>>> [Mon Apr 24 17:21:38 2017] [<ffffffff81198a5d>]
> __alloc_pages_slowpath.constprop.88+0x8fd/0xa70
> >>>> [Mon Apr 24 17:21:38 2017] [<ffffffff81198e56>]
> __alloc_pages_nodemask+0x286/0x2a0
> >>>> [Mon Apr 24 17:21:38 2017] [<ffffffff81198f0b>]
> alloc_kmem_pages_node+0x4b/0xc0
> >>>> [Mon Apr 24 17:21:38 2017] [<ffffffff8107ea5e>]
> copy_process+0x1be/0x1b70
> >>>> [Mon Apr 24 17:21:38 2017] [<ffffffff8122d013>] ?
> __fd_install+0x33/0xe0
> >>>> [Mon Apr 24 17:21:38 2017] [<ffffffff81713d01>] ?
> release_sock+0x111/0x160
> >>>> [Mon Apr 24 17:21:38 2017] [<ffffffff810805a0>] _do_fork+0x80/0x360
> >>>> [Mon Apr 24 17:21:38 2017] [<ffffffff8122429c>] ?
> SyS_select+0xcc/0x110
> >>>> [Mon Apr 24 17:21:38 2017] [<ffffffff81080929>] SyS_clone+0x19/0x20
> >>>> [Mon Apr 24 17:21:38 2017] [<ffffffff818385f2>]
> entry_SYSCALL_64_fastpath+0x16/0x71
> >>>> [Mon Apr 24 17:21:38 2017] Mem-Info:
> >>>> [Mon Apr 24 17:21:38 2017] active_anon:553952 inactive_anon:206987
> isolated_anon:0
> >>>> active_file:3410764 inactive_file:3460179
> isolated_file:0
> >>>> unevictable:4914 dirty:212868 writeback:0
> unstable:0
> >>>> slab_reclaimable:386621
> slab_unreclaimable:31829
> >>>> mapped:6112 shmem:211 pagetables:6178
> bounce:0
> >>>> free:82623 free_pcp:213 free_cma:0
> >>>> [Mon Apr 24 17:21:38 2017] Node 0 DMA free:15880kB min:32kB low:40kB
> high:48kB active_anon:0kB inactive_anon:0k
> >>>> B active_file:0kB inactive_file:0kB unevictable:0kB
> isolated(anon):0kB isolated(file):0kB present:15964kB manag
> >>>> ed:15880kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB
> slab_reclaimable:0kB slab_unreclaimable:0kB
> >>>> kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB
> local_pcp:0kB free_cma:0kB writeback_tmp:
> >>>> 0kB pages_scanned:0 all_unreclaimable? yes
> >>>> [Mon Apr 24 17:21:38 2017] lowmem_reserve[]: 0 1868 31944 31944 31944
> >>>> [Mon Apr 24 17:21:38 2017] Node 0 DMA32 free:133096kB min:3948kB
> low:4932kB high:5920kB active_anon:170764kB in
> >>>> active_anon:206296kB active_file:394236kB inactive_file:525288kB
> unevictable:980kB isolated(anon):0kB isolated(
> >>>> file):0kB present:2033596kB managed:1952976kB mlocked:980kB
> dirty:1552kB writeback:0kB mapped:3904kB shmem:724k
> >>>> B slab_reclaimable:502176kB slab_unreclaimable:8916kB
> kernel_stack:1952kB pagetables:1408kB unstable:0kB bounce
> >>>> :0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB
> pages_scanned:0 all_unreclaimable? no
> >>>> [Mon Apr 24 17:21:38 2017] lowmem_reserve[]: 0 0 30076 30076 30076
> >>>> [Mon Apr 24 17:21:38 2017] Node 0 Normal free:181516kB min:63600kB
> low:79500kB high:95400kB active_anon:2045044
> >>>> kB inactive_anon:621652kB active_file:13248820kB
> inactive_file:13315428kB unevictable:18676kB isolated(anon):0kB
> isolated(file):0kB present:31322112kB managed:30798036kB mlocked:18676kB
> dirty:849920kB writeback:0kB mapped:20544kB shmem:120kB
> slab_reclaimable:1044308kB slab_unreclaimable:118400kB kernel_stack:33792kB
> pagetables:23304kB unstable:0kB bounce:0kB free_pcp:852kB local_pcp:0kB
> free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
> >>>> [Mon Apr 24 17:21:38 2017] lowmem_reserve[]: 0 0 0 0 0
> >>>> [Mon Apr 24 17:21:38 2017] Node 0 DMA: 0*4kB 1*8kB (U) 0*16kB 0*32kB
> 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB
> >>>> 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15880kB
> >>>> [Mon Apr 24 17:21:38 2017] Node 0 DMA32: 18416*4kB (UME) 7480*8kB
> (UME) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*
> >>>> 512kB 0*1024kB 0*2048kB 0*4096kB = 133504kB
> >>>> [Mon Apr 24 17:21:38 2017] Node 0 Normal: 44972*4kB (UMEH) 13*8kB
> (EH) 13*16kB (H) 13*32kB (H) 8*64kB (H) 2*128
> >>>> kB (H) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 181384kB
> >>>> [Mon Apr 24 17:21:38 2017] Node 0 hugepages_total=0 hugepages_free=0
> hugepages_surp=0 hugepages_size=1048576kB
> >>>> [Mon Apr 24 17:21:38 2017] Node 0 hugepages_total=0 hugepages_free=0
> hugepages_surp=0 hugepages_size=2048kB
> >>>> [Mon Apr 24 17:21:38 2017] 6878703 total pagecache pages
> >>>> [Mon Apr 24 17:21:38 2017] 2484 pages in swap cache
> >>>> [Mon Apr 24 17:21:38 2017] Swap cache stats: add 3533870, delete
> 3531386, find 3743168/4627884
> >>>> [Mon Apr 24 17:21:38 2017] Free swap = 14976740kB
> >>>> [Mon Apr 24 17:21:38 2017] Total swap = 15623164kB
> >>>> [Mon Apr 24 17:21:38 2017] 8342918 pages RAM
> >>>> [Mon Apr 24 17:21:38 2017] 0 pages HighMem/MovableOnly
> >>>> [Mon Apr 24 17:21:38 2017] 151195 pages reserved
> >>>> [Mon Apr 24 17:21:38 2017] 0 pages cma reserved
> >>>> [Mon Apr 24 17:21:38 2017] 0 pages hwpoisoned
> >>>> [Mon Apr 24 17:21:38 2017] [ pid ] uid tgid total_vm rss
> nr_ptes nr_pmds swapents oom_score_adj name
> >>>> [Mon Apr 24 17:21:38 2017] [ 566] 0 566 15064 460
> 33 3 1108 0 systemd
> >>>> -journal
> >>>> [Mon Apr 24 17:21:38 2017] [ 602] 0 602 23693 182
> 16 3 0 0 lvmetad
> >>>> [Mon Apr 24 17:21:38 2017] [ 613] 0 613 11241 589
> 21 3 264 -1000 systemd
> >>>> -udevd
> >>>> [Mon Apr 24 17:21:38 2017] [ 1381] 100 1381 25081 440
> 19 3 25 0 systemd
> >>>> -timesyn
> >>>> [Mon Apr 24 17:21:38 2017] [ 1447] 0 1447 1100 307
> 7 3 0 0 acpid
> >>>> [Mon Apr 24 17:21:38 2017] [ 1449] 0 1449 7252 374
> 21 3 47 0 cron
> >>>> [Mon Apr 24 17:21:38 2017] [ 1451] 0 1451 77253 994
> 19 3 10 0 lxcfs
> >>>> [Mon Apr 24 17:21:38 2017] [ 1483] 0 1483 6511 413
> 18 3 42 0 atd
> >>>> [Mon Apr 24 17:21:38 2017] [ 1505] 0 1505 7157 286
> 18 3 36 0 systemd
> >>>> -logind
> >>>> [Mon Apr 24 17:21:38 2017] [ 1508] 104 1508 64099 376
> 27 4 712 0 rsyslog
> >>>> d
> >>>> [Mon Apr 24 17:21:38 2017] [ 1510] 107 1510 10723 497
> 25 3 45 -900 dbus-da
> >>>> emon
> >>>> [Mon Apr 24 17:21:38 2017] [ 1521] 0 1521 68970 178
> 38 3 170 0 account
> >>>> s-daemon
> >>>> [Mon Apr 24 17:21:38 2017] [ 1526] 0 1526 6548 785
> 16 3 63 0 smartd
> >>>> [Mon Apr 24 17:21:38 2017] [ 1528] 0 1528 54412 146
> 31 5 1806 0 snapd
> >>>> [Mon Apr 24 17:21:38 2017] [ 1578] 0 1578 3416 335
> 11 3 24 0 mdadm
> >>>> [Mon Apr 24 17:21:38 2017] [ 1595] 0 1595 16380 470
> 35 3 157 -1000 sshd
> >>>> [Mon Apr 24 17:21:38 2017] [ 1610] 0 1610 69295 303
> 40 4 57 0 polkitd
> >>>> [Mon Apr 24 17:21:38 2017] [ 1618] 0 1618 1306 31
> 8 3 0 0 iscsid
> >>>> [Mon Apr 24 17:21:38 2017] [ 1619] 0 1619 1431 877
> 8 3 0 -17 iscsid
> >>>> [Mon Apr 24 17:21:38 2017] [ 1624] 0 1624 126363 8027
> 122 4 22441 0 gluster
> >>>> d
> >>>> [Mon Apr 24 17:21:38 2017] [ 1688] 0 1688 4884 430
> 15 3 46 0 irqbala
> >>>> nce
> >>>> [Mon Apr 24 17:21:38 2017] [ 1699] 0 1699 3985 348
> 13 3 0 0 agetty
> >>>> [Mon Apr 24 17:21:38 2017] [ 7001] 0 7001 500631 27874
> 145 5 3356 0 gluster
> >>>> fsd
> >>>> [Mon Apr 24 17:21:38 2017] [ 8136] 0 8136 500631 28760
> 141 5 2390 0 gluster
> >>>> fsd
> >>>> [Mon Apr 24 17:21:38 2017] [ 9280] 0 9280 533529 27752
> 135 5 3200 0 gluster
> >>>> fsd
> >>>> [Mon Apr 24 17:21:38 2017] [12626] 111 12626 5991 420
> 16 3 113 0 nrpe
> >>>> [Mon Apr 24 17:21:38 2017] [14342] 0 14342 533529 28377
> 135 5 2176 0 gluster
> >>>> fsd
> >>>> [Mon Apr 24 17:21:38 2017] [14361] 0 14361 534063 29190
> 136 5 1972 0 gluster
> >>>> fsd
> >>>> [Mon Apr 24 17:21:38 2017] [14380] 0 14380 533529 28104
> 136 6 2437 0 glusterfsd
> >>>> [Mon Apr 24 17:21:38 2017] [14399] 0 14399 533529 27552
> 131 5 2808 0 glusterfsd
> >>>> [Mon Apr 24 17:21:38 2017] [14418] 0 14418 533529 29588
> 138 5 2697 0 glusterfsd
> >>>> [Mon Apr 24 17:21:38 2017] [14437] 0 14437 517080 28671
> 146 5 2170 0 glusterfsd
> >>>> [Mon Apr 24 17:21:38 2017] [14456] 0 14456 533529 28083
> 139 5 3359 0 glusterfsd
> >>>> [Mon Apr 24 17:21:38 2017] [14475] 0 14475 533529 28054
> 134 5 2954 0 glusterfsd
> >>>> [Mon Apr 24 17:21:38 2017] [14494] 0 14494 533529 28594
> 135 5 2311 0 glusterfsd
> >>>> [Mon Apr 24 17:21:38 2017] [14513] 0 14513 533529 28911
> 138 5 2833 0 glusterfsd
> >>>> [Mon Apr 24 17:21:38 2017] [14532] 0 14532 533529 28259
> 134 6 3145 0 glusterfsd
> >>>> [Mon Apr 24 17:21:38 2017] [14551] 0 14551 533529 27875
> 138 5 2267 0 glusterfsd
> >>>> [Mon Apr 24 17:21:38 2017] [14570] 0 14570 484716 28247
> 142 5 2875 0 glusterfsd
> >>>> [Mon Apr 24 17:21:38 2017] [27646] 0 27646 3697561 202086
> 2830 17 16528 0 glusterfs
> >>>> [Mon Apr 24 17:21:38 2017] [27655] 0 27655 787371 29588
> 197 6 25472 0 glusterfs
> >>>> [Mon Apr 24 17:21:38 2017] [27665] 0 27665 689585 605
> 108 6 7008 0 glusterfs
> >>>> [Mon Apr 24 17:21:38 2017] [29878] 0 29878 193833 36054
> 241 4 41182 0 glusterfs
> >>>> [Mon Apr 24 17:21:38 2017] Out of memory: Kill process 27646
> (glusterfs) score 17 or sacrifice child
> >>>> [Mon Apr 24 17:21:38 2017] Killed process 27646 (glusterfs)
> total-vm:14790244kB, anon-rss:795040kB, file-rss:13304kB
> >>>> /var/log/glusterfs/glusterd.log
> >>>> [2017-04-24 11:53:51.359603] I [MSGID: 106006]
> [glusterd-svc-mgmt.c:327:glusterd_svc_common_rpc_notify] 0-management:
> glustershd has disconnected from glusterd.
> >>>> what would have gone wrong?
> >>>> regards
> >>>> Amudhan
> >>>>
> >>>> _______________________________________________
> >>>> Gluster-users mailing list
> >>>> Gluster-users at gluster.org
> >>>> http://lists.gluster.org/mailman/listinfo/gluster-users
> >>>
> >>>
> >>>
> >>> --
> >>> Pranith
> >>
> >>
> >> _______________________________________________
> >> Gluster-users mailing list
> >> Gluster-users at gluster.org
> >> http://lists.gluster.org/mailman/listinfo/gluster-users
> >
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20170426/19cea279/attachment.html>
More information about the Gluster-users
mailing list