[Gluster-devel] glusterfsd crash due to page allocation failure

Tue Dec 22 15:40:21 UTC 2015

Pranith,

This issue continues to happen.  If you could provide instructions for 
getting you the statedump, I would be happy to send that information.
I am not sure how to get a statedump just before the crash as the crash 
is intermittent.

David

------ Original Message ------
From: "Pranith Kumar Karampuri" <pkarampu at redhat.com>
To: "Glomski, Patrick" <patrick.glomski at corvidtec.com>; 
gluster-devel at gluster.org; gluster-users at gluster.org
Cc: "David Robinson" <david.robinson at corvidtec.com>
Sent: 12/21/2015 11:59:33 PM
Subject: Re: [Gluster-devel] glusterfsd crash due to page allocation 
failure

>hi Glomski,
>         This is the second time I am hearing about memory allocation 
>problems in 3.7.6 but this time on brick side. Are you able to recreate 
>this issue? Will it be possible to get statedumps of the bricks 
>processes just before they crash?
>
>Pranith
>
>On 12/22/2015 02:25 AM, Glomski, Patrick wrote:
>>Hello,
>>
>>We've recently upgraded from gluster 3.6.6 to 3.7.6 and have started 
>>encountering dmesg page allocation errors (stack trace is appended).
>>
>>It appears that glusterfsd now sometimes fills up the cache completely 
>>and crashes with a page allocation failure. I *believe* it mainly 
>>happens when copying lots of new data to the system, running a 'find', 
>>or similar. Hosts are all Scientific Linux 6.6 and these errors occur 
>>consistently on two separate gluster pools.
>>
>>Has anyone else seen this issue and are there any known fixes for it 
>>via sysctl kernel parameters or other means?
>>
>>Please let me know of any other diagnostic information that would 
>>help.
>>
>>Thanks,
>>Patrick
>>
>>
>>>[1458118.134697] glusterfsd: page allocation failure. order:5, 
>>>mode:0x20
>>>[1458118.134701] Pid: 6010, comm: glusterfsd Not tainted 
>>>2.6.32-573.3.1.el6.x86_64 #1
>>>[1458118.134702] Call Trace:
>>>[1458118.134714]  [<ffffffff8113770c>] ? 
>>>__alloc_pages_nodemask+0x7dc/0x950
>>>[1458118.134728]  [<ffffffffa0321800>] ? 
>>>mlx4_ib_post_send+0x680/0x1f90 [mlx4_ib]
>>>[1458118.134733]  [<ffffffff81176e92>] ? kmem_getpages+0x62/0x170
>>>[1458118.134735]  [<ffffffff81177aaa>] ? fallback_alloc+0x1ba/0x270
>>>[1458118.134736]  [<ffffffff811774ff>] ? cache_grow+0x2cf/0x320
>>>[1458118.134738]  [<ffffffff81177829>] ? 
>>>____cache_alloc_node+0x99/0x160
>>>[1458118.134743]  [<ffffffff8145f732>] ? pskb_expand_head+0x62/0x280
>>>[1458118.134744]  [<ffffffff81178479>] ? __kmalloc+0x199/0x230
>>>[1458118.134746]  [<ffffffff8145f732>] ? pskb_expand_head+0x62/0x280
>>>[1458118.134748]  [<ffffffff8146001a>] ? __pskb_pull_tail+0x2aa/0x360
>>>[1458118.134751]  [<ffffffff8146f389>] ? harmonize_features+0x29/0x70
>>>[1458118.134753]  [<ffffffff8146f9f4>] ? 
>>>dev_hard_start_xmit+0x1c4/0x490
>>>[1458118.134758]  [<ffffffff8148cf8a>] ? sch_direct_xmit+0x15a/0x1c0
>>>[1458118.134759]  [<ffffffff8146ff68>] ? dev_queue_xmit+0x228/0x320
>>>[1458118.134762]  [<ffffffff8147665d>] ? 
>>>neigh_connected_output+0xbd/0x100
>>>[1458118.134766]  [<ffffffff814abc67>] ? ip_finish_output+0x287/0x360
>>>[1458118.134767]  [<ffffffff814abdf8>] ? ip_output+0xb8/0xc0
>>>[1458118.134769]  [<ffffffff814ab04f>] ? __ip_local_out+0x9f/0xb0
>>>[1458118.134770]  [<ffffffff814ab085>] ? ip_local_out+0x25/0x30
>>>[1458118.134772]  [<ffffffff814ab580>] ? ip_queue_xmit+0x190/0x420
>>>[1458118.134773]  [<ffffffff81137059>] ? 
>>>__alloc_pages_nodemask+0x129/0x950
>>>[1458118.134776]  [<ffffffff814c0c54>] ? tcp_transmit_skb+0x4b4/0x8b0
>>>[1458118.134778]  [<ffffffff814c319a>] ? tcp_write_xmit+0x1da/0xa90
>>>[1458118.134779]  [<ffffffff81178cbd>] ? __kmalloc_node+0x4d/0x60
>>>[1458118.134780]  [<ffffffff814c3a80>] ? tcp_push_one+0x30/0x40
>>>[1458118.134782]  [<ffffffff814b410c>] ? tcp_sendmsg+0x9cc/0xa20
>>>[1458118.134786]  [<ffffffff8145836b>] ? sock_aio_write+0x19b/0x1c0
>>>[1458118.134788]  [<ffffffff814581d0>] ? sock_aio_write+0x0/0x1c0
>>>[1458118.134791]  [<ffffffff8119169b>] ? 
>>>do_sync_readv_writev+0xfb/0x140
>>>[1458118.134797]  [<ffffffff810a14b0>] ? 
>>>autoremove_wake_function+0x0/0x40
>>>[1458118.134801]  [<ffffffff8123e92f>] ? 
>>>selinux_file_permission+0xbf/0x150
>>>[1458118.134804]  [<ffffffff812316d6>] ? 
>>>security_file_permission+0x16/0x20
>>>[1458118.134806]  [<ffffffff81192746>] ? do_readv_writev+0xd6/0x1f0
>>>[1458118.134807]  [<ffffffff811928a6>] ? vfs_writev+0x46/0x60
>>>[1458118.134809]  [<ffffffff811929d1>] ? sys_writev+0x51/0xd0
>>>[1458118.134812]  [<ffffffff810e88ae>] ? 
>>>__audit_syscall_exit+0x25e/0x290
>>>[1458118.134816]  [<ffffffff8100b0d2>] ? 
>>>system_call_fastpath+0x16/0x1b
>>
>>
>>
>>_______________________________________________ Gluster-devel mailing 
>>list 
>>Gluster-devel at gluster.orghttp://www.gluster.org/mailman/listinfo/gluster-devel
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.gluster.org/pipermail/gluster-devel/attachments/20151222/a00ad012/attachment.html>