[Bugs] [Bug 1180231] New: glusterfs-fuse: Crash due to race in FUSE notify when multiple epoll threads invoke the routine

Thu Jan 8 16:39:12 UTC 2015

https://bugzilla.redhat.com/show_bug.cgi?id=1180231

            Bug ID: 1180231
           Summary: glusterfs-fuse: Crash due to race in FUSE notify when
                    multiple epoll threads invoke the routine
           Product: GlusterFS
           Version: mainline
         Component: fuse
          Assignee: bugs at gluster.org
          Reporter: srangana at redhat.com
                CC: bugs at gluster.org, gluster-bugs at redhat.com

Description of problem:
On running through the regression test suite for the multi-thread epoll patch
(http://review.gluster.org/#/c/3842/), crashes were observed in glusterfs as
follows, these are mostly when the volume is just mounted via FUSE and a graph
change event is triggered (or even otherwise at times, i.e no events generated)

Core details:
Core was generated by `glusterfs --volfile-server=127.1.1.1 --volfile-id=patchy
/mnt/glusterfs/0'.
Program terminated with signal 11, Segmentation fault.
#0  get_fuse_state (this=0x1f9c8e0, finh=0x7f3c28000900) at fuse-helpers.c:127
127                     active_subvol->winds++;

(gdb) bt
#0  get_fuse_state (this=0x1f9c8e0, finh=0x7f3c28000900) at fuse-helpers.c:127
#1  0x00007f3c472ef6cd in fuse_getattr (this=0x1f9c8e0, finh=0x7f3c28000900,
msg=<value optimized out>) at fuse-bridge.c:846
#2  0x00007f3c472f9950 in fuse_thread_proc (data=0x1f9c8e0) at
fuse-bridge.c:4899
#3  0x00007f3c4f0b19d1 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f3c4ea1b9dd in clone () from /lib64/libc.so.6

(gdb) p active_subvol 
$1 = (xlator_t *) 0x0

<<edited output below>>
(gdb) p *(fuse_private_t *)state->this->private 
$7 = {fd = 8, volfile = 0x0, mount_point = 0x1f9db20 "/mnt/glusterfs/0",
fuse_thread = 139896664205056, fuse_thread_started = 1 '\001',<...> event_recvd
= 1 '\001', init_recvd = 1 '\001', <...> next_graph = 0x7f3c34000900,
active_subvol = 0x0, <...> use_readdirp = _gf_true}

How reproducible:
Happens quite a few times on the regression runs, quite difficult to reproduce
on some other machines though.

Steps to Reproduce:
Regrssion runs failed in the above review, or specifically test case,
bug-948686.t

Additional info:
The root cause is as follows,

1) As there are 2 epoll threads (or more) we get notification on child up in
both threads on the same graph (graph ID 0).

2) In the notify when both threads race and call fuse_graph_setup, one would
succeed setting the graph->used and the other would bail out seeing that the
graph is used.

3) Now if the thread that bails out, would start the fuse_thread_proc and the
other would still be updating fuse private

4) As fuse_thread_proc is started, and the other thread has not completed
fuse_graph_setup, priv->next_graph would be NULL at this point, causing the
fuse_graph_sync to not promote this to active_subvol

As a result when the call is processed active_subvol is NULL and we crash.

Other areas where graph can go NULL are explored (by Du and self) and the graph
can never be NULL, so this happens at startup time itself due to the race
above.

The resolution for this looks like increasing the critical section where graph
is used and set into next_graph, so that the fuse_graph_sync sees the right
state.

I did have a log message on when graph_sync is called and when graph_setup is
done, etc. So what happens is that graph sync is called, but there are 2
graph_setup racing at that point causing the said issue.

I did not verify that one of the threads exited detecting graph->used as true
though. That seems to be most likely as there are more epoll threads and notify
would race.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.