[Gluster-users] incomplete listing of a directory, sometimes getdents loops until out of memory
jdarcy at redhat.com
Fri Jun 14 16:01:00 UTC 2013
On 06/13/2013 03:38 PM, John Brunelle wrote:
> We have a directory containing 3,343 subdirectories. On some
> clients, ls lists only a subset of the directories (a different
> amount on different clients). On others, ls gets stuck in a getdents
> loop and consumes more and more memory until it hits ENOMEM. On yet
> others, it works fine. Having the bad clients remount or drop caches
> makes the problem temporarily go away, but eventually it comes back.
> The issue sounds a lot like bug #838784, but we are using xfs on the
> backend, and this seems like more of a client issue.
The fact that drop_caches makes it go away temporarily suggests to me
that something's going on in FUSE. The reference to #838784 might also
be significant even though you're not using ext4. Even the fix for that
still makes some assumptions about how certain directory-entry fields
are used and might still be sensitive to changes in that usage by the
local FS or by FUSE. That might explain both skipping and looping, as
you say you've seen. Would it be possible for you to compile and run
the attached program on one of the affected directories so we can see
what d_off values are involved?
> But we are also getting some page allocation failures on the server
> side, e.g. the stack strace below. These are nearly identical to
> bug #842206 and bug #767127. I'm trying to sort out if these are
> related to the above issue or just recoverable nic driver GFP_ATOMIC
> kmalloc failures as suggested in the comments. Slab allocations for
> dentry, xfs_inode, fuse_inode, fuse_request, etc. are all at ~100%
> active, and the total number appears to be monotonically growing.
> Overall memory looks healthy (2/3 is buffers/cache, almost no swap is
> used). I'd need some help to determine if the memory is overly
> fragmented or not, but looking at pagetypeinfo and zoneinfo It
> doesn't appear so to me, and the failures are order:1 anyways.
This one definitely seems like one of those "innocent victim" kind of
things where the real problem is in the network code and we just happen
to be the app that's running.
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 702 bytes
Desc: not available
More information about the Gluster-users