[Gluster-devel] glusterd stuck for glusterfs with version 3.12.15
Raghavendra Gowdappa
rgowdapp at redhat.com
Mon Apr 15 08:35:54 UTC 2019
On Mon, Apr 15, 2019 at 12:52 PM Zhou, Cynthia (NSB - CN/Hangzhou) <
cynthia.zhou at nokia-sbell.com> wrote:
> Hi,
>
> The reason why I move event_handled to the end of socket_event_poll_in is
> because if event_handled is called before rpc_transport_pollin_destroy, it
> allowed another round of event_dispatch_epoll_handler happen before
> rpc_transport_pollin_destroy, in this way, when the latter poll in goes to
> rpc_transport_pollin_destroy , there is a chance that the pollin->iobref
> has already been destroyed by the first one(there is no lock destroy for
> iobref->lock in iobref_destroy by the way). That may cause stuck in “LOCK
> (&iobref->lock);”
>
> I find the one of recent patch
>
> SHA-1: f747d55a7fd364e2b9a74fe40360ab3cb7b11537
>
>
>
> ** socket: fix issue on concurrent handle of a socket*
>
>
>
> I think the point is to avoid the concurrent handling of the same socket
> at the same time, but after my test with this patch this problem also
> exists, so I think event_handled is still called too early to allow
> concurrent handling of the same socket happen,
>
But concurrent handling is required for performance. So, we cannot
serialize it.
and after move it to the end of socket_event_poll this glusterd stuck issue
> disappeared.
>
Ideally there shouldn't be a single instance of datastructure that should
be shared between two instances of pollin handling. My initial code-reading
didn't find any issues with the way iobref is handled even when there is
concurrent reading when the previous message was still not notified. I'll
continue to investigate how objects are shared across two instances of
pollin. Will post if I find anything interesting.
cynthia
>
> *From:* Raghavendra Gowdappa <rgowdapp at redhat.com>
> *Sent:* Monday, April 15, 2019 2:36 PM
> *To:* Zhou, Cynthia (NSB - CN/Hangzhou) <cynthia.zhou at nokia-sbell.com>
> *Cc:* gluster-devel at gluster.org
> *Subject:* Re: glusterd stuck for glusterfs with version 3.12.15
>
>
>
>
>
>
>
> On Mon, Apr 15, 2019 at 11:08 AM Zhou, Cynthia (NSB - CN/Hangzhou) <
> cynthia.zhou at nokia-sbell.com> wrote:
>
> Ok, thanks for your comment!
>
>
>
> cynthia
>
>
>
> *From:* Raghavendra Gowdappa <rgowdapp at redhat.com>
> *Sent:* Monday, April 15, 2019 11:52 AM
> *To:* Zhou, Cynthia (NSB - CN/Hangzhou) <cynthia.zhou at nokia-sbell.com>
> *Cc:* gluster-devel at gluster.org
> *Subject:* Re: glusterd stuck for glusterfs with version 3.12.15
>
>
>
> Cynthia,
>
>
>
> On Mon, Apr 15, 2019 at 8:10 AM Zhou, Cynthia (NSB - CN/Hangzhou) <
> cynthia.zhou at nokia-sbell.com> wrote:
>
> Hi,
>
> I made a patch and according to my test, this glusterd stuck issue
> disappear with my patch. Only need to move event_handled to the end of
> socket_event_poll_in function.
>
>
>
> --- a/rpc/rpc-transport/socket/src/socket.c
>
> +++ b/rpc/rpc-transport/socket/src/socket.c
>
> @@ -2305,9 +2305,9 @@ socket_event_poll_in (rpc_transport_t *this,
> gf_boolean_t notify_handled)
>
> }
>
>
>
> - if (notify_handled && (ret != -1))
>
> - event_handled (ctx->event_pool, priv->sock, priv->idx,
>
> - priv->gen);
>
> @@ -2330,6 +2327,9 @@ socket_event_poll_in (rpc_transport_t *this,
> gf_boolean_t notify_handled)
>
> }
>
> pthread_mutex_unlock (&priv->notify.lock);
>
> }
>
> + if (notify_handled && (ret != -1))
>
> + event_handled (ctx->event_pool, priv->sock, priv->idx,
>
> + priv->gen);
>
>
>
> Thanks for this tip. Though this helps in fixing the hang, this change has
> performance impact. Moving event_handled to end of poll_in means, socket
> will be added back for polling of new events only _after_ the rpc is msg is
> processed by higher layers (like EC) and higher layers can have significant
> latency for processing the msg. Which means, socket will be out of polling
> for longer periods of time which decreases the throughput (number of msgs
> read per second) affecting performance. However, this experiment definitely
> indicates there is a codepath where event_handled is not called (and hence
> causing the hang). I'll go through this codepath again.
>
>
>
> Can you check whether patch [1] fixes the issue you are seeing?
>
>
>
> [1] https://review.gluster.org/#/c/glusterfs/+/22566
>
>
>
>
>
> Thanks for that experiment :).
>
>
>
> return ret;
>
> }
>
>
>
> cynthia
>
> *From:* Zhou, Cynthia (NSB - CN/Hangzhou)
> *Sent:* Tuesday, April 09, 2019 3:57 PM
> *To:* 'Raghavendra Gowdappa' <rgowdapp at redhat.com>
> *Cc:* gluster-devel at gluster.org
> *Subject:* RE: glusterd stuck for glusterfs with version 3.12.15
>
>
>
> Can you figure out some possible reason why iobref is corrupted, is it
> possible that thread 8 has called poll in and iobref has been relased, but
> the lock within it is not properly released (as I can not find any free
> lock operation in iobref_destroy), then thread 9 called rpc_transport_pollin_destroy
> again, and so stuck on this lock
>
> Also, there should not be two thread handling the same socket at the same
> time, although there has been a patch claimed to tackle this issue.
>
>
>
> cynthia
>
>
>
> *From:* Raghavendra Gowdappa <rgowdapp at redhat.com>
> *Sent:* Tuesday, April 09, 2019 3:52 PM
> *To:* Zhou, Cynthia (NSB - CN/Hangzhou) <cynthia.zhou at nokia-sbell.com>
> *Cc:* gluster-devel at gluster.org
> *Subject:* Re: glusterd stuck for glusterfs with version 3.12.15
>
>
>
>
>
>
>
> On Mon, Apr 8, 2019 at 7:42 AM Zhou, Cynthia (NSB - CN/Hangzhou) <
> cynthia.zhou at nokia-sbell.com> wrote:
>
> Hi glusterfs experts,
>
> Good day!
>
> In my test env, sometimes glusterd stuck issue happened, and it is not
> responding to any gluster commands, when I checked this issue I find that
> glusterd thread 9 and thread 8 is dealing with the same socket, I thought
> following patch should be able to solve this issue, however after I merged
> this patch this issue still exist. When I looked into this code, it seems
> socket_event_poll_in called event_handled before
> rpc_transport_pollin_destroy, I think this gives the chance for another
> poll for the exactly the same socket. And caused this glusterd stuck issue,
> also, I find there is no LOCK_DESTROY(&iobref->lock)
>
> In iobref_destroy, I think it is better to add destroy lock.
>
> Following is the gdb info when this issue happened, I would like to know
> your opinion on this issue, thanks!
>
>
>
> SHA-1: f747d55a7fd364e2b9a74fe40360ab3cb7b11537
>
>
>
> ** socket: fix issue on concurrent handle of a socket*
>
>
>
>
>
>
>
> *GDB INFO:*
>
> Thread 8 is blocked on pthread_cond_wait, and thread 9 is blocked in
> iobref_unref, I think
>
> Thread 9 (Thread 0x7f9edf7fe700 (LWP 1933)):
>
> #0 0x00007f9ee9fd785c in __lll_lock_wait () from /lib64/libpthread.so.0
>
> #1 0x00007f9ee9fda657 in __lll_lock_elision () from /lib64/libpthread.so.0
>
> #2 0x00007f9eeb24cae6 in iobref_unref (iobref=0x7f9ed00063b0) at
> iobuf.c:944
>
> #3 0x00007f9eeafd2f29 in rpc_transport_pollin_destroy
> (pollin=0x7f9ed00452d0) at rpc-transport.c:123
>
> #4 0x00007f9ee4fbf319 in socket_event_poll_in (this=0x7f9ed0049cc0,
> notify_handled=_gf_true) at socket.c:2322
>
> #5 0x00007f9ee4fbf932 in socket_event_handler (*fd=36, idx=27, gen=4,
> data=0x7f9ed0049cc0, poll_in=1, poll_out=0, poll_err=0*) at socket.c:2471
>
> #6 0x00007f9eeb2825d4 in event_dispatch_epoll_handler
> (event_pool=0x17feb00, event=0x7f9edf7fde84) at event-epoll.c:583
>
> #7 0x00007f9eeb2828ab in event_dispatch_epoll_worker (data=0x180d0c0) at
> event-epoll.c:659
>
> #8 0x00007f9ee9fce5da in start_thread () from /lib64/libpthread.so.0
>
> #9 0x00007f9ee98a4eaf in clone () from /lib64/libc.so.6
>
>
>
> Thread 8 (Thread 0x7f9edffff700 (LWP 1932)):
>
> #0 0x00007f9ee9fd785c in __lll_lock_wait () from /lib64/libpthread.so.0
>
> #1 0x00007f9ee9fd2b42 in __pthread_mutex_cond_lock () from
> /lib64/libpthread.so.0
>
> #2 0x00007f9ee9fd44a8 in pthread_cond_wait@@GLIBC_2.3.2 () from
> /lib64/libpthread.so.0
>
> #3 0x00007f9ee4fbadab in socket_event_poll_err (this=0x7f9ed0049cc0,
> gen=4, idx=27) at socket.c:1201
>
> #4 0x00007f9ee4fbf99c in socket_event_handler (*fd=36, idx=27, gen=4,
> data=0x7f9ed0049cc0, poll_in=1, poll_out=0, poll_err=0*) at socket.c:2480
>
> #5 0x00007f9eeb2825d4 in event_dispatch_epoll_handler
> (event_pool=0x17feb00, event=0x7f9edfffee84) at event-epoll.c:583
>
> #6 0x00007f9eeb2828ab in event_dispatch_epoll_worker (data=0x180cf20) at
> event-epoll.c:659
>
> #7 0x00007f9ee9fce5da in start_thread () from /lib64/libpthread.so.0
>
> #8 0x00007f9ee98a4eaf in clone () from /lib64/libc.so.6
>
>
>
> (gdb) thread 9
>
> [Switching to thread 9 (Thread 0x7f9edf7fe700 (LWP 1933))]
>
> #0 0x00007f9ee9fd785c in __lll_lock_wait () from /lib64/libpthread.so.0
>
> (gdb) bt
>
> #0 0x00007f9ee9fd785c in __lll_lock_wait () from /lib64/libpthread.so.0
>
> #1 0x00007f9ee9fda657 in __lll_lock_elision () from /lib64/libpthread.so.0
>
> #2 0x00007f9eeb24cae6 in iobref_unref (iobref=0x7f9ed00063b0) at
> iobuf.c:944
>
> #3 0x00007f9eeafd2f29 in rpc_transport_pollin_destroy
> (pollin=0x7f9ed00452d0) at rpc-transport.c:123
>
> #4 0x00007f9ee4fbf319 in socket_event_poll_in (this=0x7f9ed0049cc0,
> notify_handled=_gf_true) at socket.c:2322
>
> #5 0x00007f9ee4fbf932 in socket_event_handler (fd=36, idx=27, gen=4,
> data=0x7f9ed0049cc0, poll_in=1, poll_out=0, poll_err=0) at socket.c:2471
>
> #6 0x00007f9eeb2825d4 in event_dispatch_epoll_handler
> (event_pool=0x17feb00, event=0x7f9edf7fde84) at event-epoll.c:583
>
> #7 0x00007f9eeb2828ab in event_dispatch_epoll_worker (data=0x180d0c0) at
> event-epoll.c:659
>
> #8 0x00007f9ee9fce5da in start_thread () from /lib64/libpthread.so.0
>
> #9 0x00007f9ee98a4eaf in clone () from /lib64/libc.so.6
>
> (gdb) frame 2
>
> #2 0x00007f9eeb24cae6 in iobref_unref (iobref=0x7f9ed00063b0) at
> iobuf.c:944
>
> 944 iobuf.c: No such file or directory.
>
> (gdb) print *iobref
>
> $1 = {lock = {spinlock = 2, mutex = {__data = {__lock = 2, __count = 222,
> __owner = -2120437760, __nusers = 1, __kind = 8960, __spins = 512,
>
> __elision = 0, __list = {__prev = 0x4000, __next =
> 0x7f9ed00063b000}},
>
> __size =
> "\002\000\000\000\336\000\000\000\000\260\234\201\001\000\000\000\000#\000\000\000\002\000\000\000@
> \000\000\000\000\000\000\000\260c\000О\177", __align = 953482739714}},
> ref = -256, iobrefs = 0xffffffffffffffff, alloced = -1, used = -1}
>
>
>
> looks like the iobref is corrupted here. It seems to be a use-after-free
> issue. We need to dig into why a freed iobref is being accessed here.
>
>
>
> (gdb) quit
>
> A debugging session is active.
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-devel/attachments/20190415/46447893/attachment-0001.html>
More information about the Gluster-devel
mailing list