[Gluster-users] incomplete listing of a directory, sometimes getdents loops until out of memory

John Brunelle john_brunelle at harvard.edu
Thu Jun 13 19:38:29 UTC 2013


Hello,

We're having an issue with our distributed gluster filesystem:

* gluster 3.3.1 servers and clients
* distributed volume -- 69 bricks (4.6T each) split evenly across 3 nodes
* xfs backend
* nfs clients
* nfs.enable-ino32: On

* servers: CentOS 6.3, 2.6.32-279.14.1.el6.centos.plus.x86_64
* cleints: CentOS 5.7, 2.6.18-274.12.1.el5

We have a directory containing 3,343 subdirectories.  On some clients,
ls lists only a subset of the directories (a different amount on
different clients).  On others, ls gets stuck in a getdents loop and
consumes more and more memory until it hits ENOMEM.  On yet others, it
works fine.  Having the bad clients remount or drop caches makes the
problem temporarily go away, but eventually it comes back.  The issue
sounds a lot like bug #838784, but we are using xfs on the backend,
and this seems like more of a client issue.

But we are also getting some page allocation failures on the server
side, e.g. the stack strace below.  These are nearly identical to bug
#842206 and bug #767127.  I'm trying to sort out if these are related
to the above issue or just recoverable nic driver GFP_ATOMIC kmalloc
failures as suggested in the comments.  Slab allocations for dentry,
xfs_inode, fuse_inode, fuse_request, etc. are all at ~100% active, and
the total number appears to be monotonically growing.  Overall memory
looks healthy (2/3 is buffers/cache, almost no swap is used).  I'd
need some help to determine if the memory is overly fragmented or not,
but looking at pagetypeinfo and zoneinfo It doesn't appear so to me,
and the failures are order:1 anyways.

Any suggestions for what might be the problem here?

Thanks,

John

Jun 13 09:41:18 myhost kernel: glusterfsd: page allocation failure.
order:1, mode:0x20
Jun 13 09:41:18 myhost kernel: Pid: 20498, comm: glusterfsd Not
tainted 2.6.32-279.14.1.el6.centos.plus.x86_64 #1
Jun 13 09:41:18 myhost kernel: Call Trace:
Jun 13 09:41:18 myhost kernel: <IRQ>  [<ffffffff8112790f>] ?
__alloc_pages_nodemask+0x77f/0x940
Jun 13 09:41:18 myhost kernel: [<ffffffff81162382>] ? kmem_getpages+0x62/0x170
Jun 13 09:41:18 myhost kernel: [<ffffffff81162f9a>] ? fallback_alloc+0x1ba/0x270
Jun 13 09:41:18 myhost kernel: [<ffffffff811629ef>] ? cache_grow+0x2cf/0x320
Jun 13 09:41:18 myhost kernel: [<ffffffff81162d19>] ?
____cache_alloc_node+0x99/0x160
Jun 13 09:41:18 myhost kernel: [<ffffffff81163afb>] ?
kmem_cache_alloc+0x11b/0x190
Jun 13 09:41:18 myhost kernel: [<ffffffff81435298>] ? sk_prot_alloc+0x48/0x1c0
Jun 13 09:41:18 myhost kernel: [<ffffffff81435562>] ? sk_clone+0x22/0x2e0
Jun 13 09:41:18 myhost kernel: [<ffffffff814833a6>] ? inet_csk_clone+0x16/0xd0
Jun 13 09:41:18 myhost kernel: [<ffffffff8149c383>] ?
tcp_create_openreq_child+0x23/0x450
Jun 13 09:41:18 myhost kernel: [<ffffffff81499bed>] ?
tcp_v4_syn_recv_sock+0x4d/0x310
Jun 13 09:41:18 myhost kernel: [<ffffffff8149c126>] ? tcp_check_req+0x226/0x460
Jun 13 09:41:18 myhost kernel: [<ffffffff81437087>] ? __kfree_skb+0x47/0xa0
Jun 13 09:41:18 myhost kernel: [<ffffffff8149960b>] ? tcp_v4_do_rcv+0x35b/0x430
Jun 13 09:41:18 myhost kernel: [<ffffffff8149ae4e>] ? tcp_v4_rcv+0x4fe/0x8d0
Jun 13 09:41:18 myhost kernel: [<ffffffff81432f6c>] ? sk_reset_timer+0x1c/0x30
Jun 13 09:41:18 myhost kernel: [<ffffffff81478add>] ?
ip_local_deliver_finish+0xdd/0x2d0
Jun 13 09:41:18 myhost kernel: [<ffffffff81478d68>] ? ip_local_deliver+0x98/0xa0
Jun 13 09:41:18 myhost kernel: [<ffffffff8147822d>] ? ip_rcv_finish+0x12d/0x440
Jun 13 09:41:18 myhost kernel: [<ffffffff814787b5>] ? ip_rcv+0x275/0x350
Jun 13 09:41:18 myhost kernel: [<ffffffff81441deb>] ?
__netif_receive_skb+0x49b/0x6f0
Jun 13 09:41:18 myhost kernel: [<ffffffff8149813a>] ? tcp4_gro_receive+0x5a/0xd0
Jun 13 09:41:18 myhost kernel: [<ffffffff81444068>] ?
netif_receive_skb+0x58/0x60
Jun 13 09:41:18 myhost kernel: [<ffffffff81444170>] ? napi_skb_finish+0x50/0x70
Jun 13 09:41:18 myhost kernel: [<ffffffff814466a9>] ? napi_gro_receive+0x39/0x50
Jun 13 09:41:18 myhost kernel: [<ffffffffa01303b4>] ? igb_poll+0x864/0xb00 [igb]
Jun 13 09:41:18 myhost kernel: [<ffffffff810606ec>] ?
rebalance_domains+0x3cc/0x5a0
Jun 13 09:41:18 myhost kernel: [<ffffffff814467c3>] ? net_rx_action+0x103/0x2f0
Jun 13 09:41:18 myhost kernel: [<ffffffff81096523>] ?
hrtimer_get_next_event+0xc3/0x100
Jun 13 09:41:18 myhost kernel: [<ffffffff81073f61>] ? __do_softirq+0xc1/0x1e0
Jun 13 09:41:18 myhost kernel: [<ffffffff810dbb70>] ?
handle_IRQ_event+0x60/0x170
Jun 13 09:41:18 myhost kernel: [<ffffffff8100c24c>] ? call_softirq+0x1c/0x30
Jun 13 09:41:18 myhost kernel: [<ffffffff8100de85>] ? do_softirq+0x65/0xa0
Jun 13 09:41:18 myhost kernel: [<ffffffff81073d45>] ? irq_exit+0x85/0x90
Jun 13 09:41:18 myhost kernel: [<ffffffff8150d505>] ? do_IRQ+0x75/0xf0
Jun 13 09:41:18 myhost kernel: [<ffffffff8100ba53>] ? ret_from_intr+0x0/0x11



More information about the Gluster-users mailing list