[Bugs] [Bug 1464327] glusterfs client crashes when reading large directory

Wed Jul 5 15:18:11 UTC 2017

https://bugzilla.redhat.com/show_bug.cgi?id=1464327

--- Comment #2 from Csaba Henk <csaba at redhat.com> ---
# Analysis 1/2

The stack trace shown in the bug description indicates a memory corruption, but
the corruption
itself occurred earlier. (It's a stack overflow, as we are to see.)

The inode table comes with a hash table of dentries. table->name_hash is the
array of buckets,
each of which holds list of dentry_t objects (dentries). The crash in the
__dentry_grep function
occurs because list pointers are corrupt in the bucket in which a certain
dentry is sought.

(gdb) l
669                     return NULL;
770
771             hash = hash_dentry (parent, name, table->hashsize);
772
773             list_for_each_entry (tmp, &table->name_hash[hash], hash) {
774                     if (tmp->parent == parent && !strcmp (tmp->name, name))
{
775                             dentry = tmp;
776                             break;
777                     }
778             }
(gdb) p tmp
$1 = (dentry_t *) 0xfffffffffffffff0
(gdb) p table->name_hash[hash]
$2 = {next = 0x0, prev = 0x0}
(gdb) p (dentry_t *)((char *)(table->name_hash[hash].next) - sizeof(struct
list_head))
$3 = (dentry_t *) 0xfffffffffffffff0

So it can be seen that tmp is the dentry pointer which comes as the first
element in the bucket
list at the hash key, and its invalid, so when its parent member is attempted
to be retrieved, a
crash occurs. In a list, the list pointers should always be valid and consist a
circle.
When an inode table is created in inode_table_new(), all the buckets list are
properly
initialized, so they should be always be valid during the lifetime of the
table. That it's not
the case indicates a memory corruption. So question is where the corruption
occurs.

An interesting other stack trace can be provoked if we slightly alter the
reproduction
instructions: run the two commands, the for loop for file creation and the ls
for listing the
files, simultaneously (maybe starting the ls with some delay). Most of the
times the first kind
of stack trace will be seen, but sometimes the following comes up:

(gdb) bt
#0  frame_fill_groups (frame=frame at entry=0x7f44bc040bb0) at fuse-helpers.c:158
#1  0x00007f44fa08f1d6 in get_groups (frame=0x7f44bc040bb0,
priv=0x7f4503405040) at fuse-helpers.c:321
#2  get_call_frame_for_req (state=state at entry=0x7f44d0004aa0) at
fuse-helpers.c:366
#3  0x00007f44fa0977d0 in fuse_unlink_resume (state=0x7f44d0004aa0) at
fuse-bridge.c:1631
#4  0x00007f44fa0915c5 in fuse_resolve_done (state=<optimized out>) at
fuse-resolve.c:663
#5  fuse_resolve_all (state=<optimized out>) at fuse-resolve.c:690
#6  0x00007f44fa0912d8 in fuse_resolve (state=0x7f44d0004aa0) at
fuse-resolve.c:654
#7  0x00007f44fa09160e in fuse_resolve_all (state=<optimized out>) at
fuse-resolve.c:686
#8  0x00007f44fa0908f3 in fuse_resolve_continue
(state=state at entry=0x7f44d0004aa0) at fuse-resolve.c:706
#9  0x00007f44fa090ae7 in fuse_resolve_entry_cbk (frame=<optimized out>,
cookie=<optimized out>, this=0x7f45033feef0, op_ret=0, op_errno=0,
inode=0x7f44e6937830,
    buf=0x7f44ec6d6c60, xattr=0x0, postparent=0x7f44ec6d6cd0) at
fuse-resolve.c:76
#10 0x00007f44ef9cd069 in io_stats_lookup_cbk (frame=0x7f44d0065800,
cookie=<optimized out>, this=<optimized out>, op_ret=0, op_errno=0,
inode=0x7f44e6937830,
    buf=0x7f44ec6d6c60, xdata=0x0, postparent=0x7f44ec6d6cd0) at
io-stats.c:2190
#11 0x00007f4502cb14d1 in default_lookup_cbk (frame=frame at entry=0x7f44d0060840,
cookie=<optimized out>, this=<optimized out>, op_ret=op_ret at entry=0,
    op_errno=op_errno at entry=0, inode=0x7f44e6937830,
buf=buf at entry=0x7f44ec6d6c60, xdata=0x0,
postparent=postparent at entry=0x7f44ec6d6cd0) at defaults.c:1265
#12 0x00007f44efdf8933 in mdc_lookup (frame=0x7f44bc040bb0, this=<optimized
out>, loc=0x7f44b8d324e0, xdata=<optimized out>) at md-cache.c:1123
#13 0x00007f4502cc5b92 in default_lookup_resume (frame=0x7f44d0060840,
this=0x7f44f001d280, loc=0x7f44b8d324e0, xdata=0x0) at defaults.c:1872
#14 0x00007f4502c55b25 in call_resume (stub=0x7f44b8d32490) at call-stub.c:2508
#15 0x00007f44efbe3957 in iot_worker (data=0x7f44f002c900) at io-threads.c:220
#16 0x00007f4501a92dc5 in start_thread (arg=0x7f44ec6d7700) at
pthread_create.c:308
#17 0x00007f45013d773d in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:113
(gdb) l
153             char            line[4096];
154             char           *ptr           = NULL;
155             FILE           *fp            = NULL;
156             int             idx           = 0;
157             long int        id            = 0;
158             char           *saveptr       = NULL;
159             char           *endptr        = NULL;
160             int             ret           = 0;
161             int             ngroups       = FUSE_MAX_AUX_GROUPS;
162             gid_t           mygroups[GF_MAX_AUX_GROUPS];

What is interesting about it is that the crash is indicated at a variable
declaration which doesn't
have too much "crash potential".

We can explore the scenario further with Electric Fence
(http://elinux.org/Electric_Fence). (In
RHEL/Centos/Fedora it's packaged as ElectricFence).

Start glusterfs in gdb and set up libefence for preloading, and run then
glusterfs:

(gdb) set exec-wrapper env  LD_PRELOAD=/usr/lib64/libefence.so
(gdb) run --entry-timeout=0 --gid-timeout=0 --volfile=<VOLFILE> -N --log-file=-
--log-level=INFO <MOUNTPOINT>

Then performing the step of reproduction (no need to do them in parallel this
time) we'll hit this:

Thread 7 "glusterfs" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff7e87700 (LWP 17452)]
0x00007ffff4eb450c in frame_fill_groups (frame=<error reading variable: Cannot
access memory at address 0x7ffff7e45468>) at fuse-helpers.c:148
148     {
(gdb) bt
#0  0x00007ffff4eb450c in frame_fill_groups (frame=<error reading variable:
Cannot access memory at address 0x7ffff7e45468>) at fuse-helpers.c:148
#1  0x00007ffff4eb4938 in get_groups (priv=0x7ffff4de4df0,
frame=0x7fffedf26f20) at fuse-helpers.c:302
...
(gdb) l
143
144
145     #define FUSE_MAX_AUX_GROUPS 32 /* We can get only up to 32 aux groups
from /proc */
146     void
147     frame_fill_groups (call_frame_t *frame)
148     {
149     #if defined(GF_LINUX_HOST_OS)
150             xlator_t       *this          = frame->this;
151             fuse_private_t *priv          = this->private;
152             char            filename[32];

A crash with a very similar stack trace, just now the point of crash is
indicated at the opening
brace of the offending function, and its first argument is an unreadable
address. The safeguard
mechanisms of libefence hint that this is the location of the corruption, and
"this" can be
identified as the entry to frame_fill_groups, ie. the point when the runtime
sets up the stack for
calling frame_fill_groups. So a stack issue is quite likely at this point.

Let's look further into frame_fill_groups!

#define FUSE_MAX_AUX_GROUPS 32 /* We can get only up to 32 aux groups from
/proc */
void
frame_fill_groups (call_frame_t *frame)
{
#if defined(GF_LINUX_HOST_OS)
        xlator_t       *this          = frame->this;
        fuse_private_t *priv          = this->private;
        char            filename[32];
        char            line[4096];
        char           *ptr           = NULL;
        FILE           *fp            = NULL;
        int             idx           = 0;
        long int        id            = 0;
        char           *saveptr       = NULL;
        char           *endptr        = NULL;
        int             ret           = 0;
        int             ngroups       = FUSE_MAX_AUX_GROUPS;
        gid_t           mygroups[GF_MAX_AUX_GROUPS];

        if (priv->resolve_gids) {

There is one thing allocated on stack that's bigger than a scalar: the mygroups
array. How big is
it?

/* GlusterFS's maximum supported Auxiliary GIDs */
#define GF_MAX_AUX_GROUPS   65535

(gdb) p sizeof(gid_t)
$26 = 4

So a 64k sized integer array is allocated on stack, that means a buffer of 256k
bytes. That's a
likely culprit for a stack overflow. The hypothesis can be quickly tested with
replacing the stack
allocation with a heap one, and we can see that it eliminates the crash and
Electric Fence is also
made happy with that.

Notes:

- The "gid_t mygroups[GF_MAX_AUX_GROUPS]" pattern occurs at a few other places
too, it's not clear
  why at this location did the stack overflow occur.
- It's also not clear why exactly the issue occurs with this configuration
(io-threads and md-cache
  and the {entry,gid}_timeout=0 settings).
- The "culprit" is change I7ede90d0e41bcf55755cced5747fa0fb1699edb2
  (https://review.gluster.org/#/q/I7ede90d0e41bcf55755cced5747fa0fb1699edb2),
which is present in
  GlusterFS 3.8.0 and also backported to 3.6 and 3.7 branches.
- It's also not clear at this point what's the relationship with the stack
overflow and the
  corruption in the inode table. In the followup comment we'll explore that
further.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
Unsubscribe from this bug https://bugzilla.redhat.com/token.cgi?t=SMCHFLlNGm&a=cc_unsubscribe