[Gluster-devel] glusterd stuck for glusterfs with version 3.12.15
Raghavendra Gowdappa
rgowdapp at redhat.com
Thu Apr 25 06:07:23 UTC 2019
On Mon, Apr 15, 2019 at 12:52 PM Zhou, Cynthia (NSB - CN/Hangzhou) <
cynthia.zhou at nokia-sbell.com> wrote:
> Hi,
>
> The reason why I move event_handled to the end of socket_event_poll_in is
> because if event_handled is called before rpc_transport_pollin_destroy, it
> allowed another round of event_dispatch_epoll_handler happen before
> rpc_transport_pollin_destroy, in this way, when the latter poll in goes to
> rpc_transport_pollin_destroy , there is a chance that the pollin->iobref
> has already been destroyed by the first one(there is no lock destroy for
> iobref->lock in iobref_destroy by the way). That may cause stuck in “LOCK
> (&iobref->lock);”
>
But, priv->incoming.iobref (from which pollin->iobref is initialized from)
is set to NULL in __socket_proto_state_machine:
if (in->record_state == SP_STATE_COMPLETE) {
in->record_state = SP_STATE_NADA;
__socket_reset_priv (priv);
}
And since pollin is an allocated object only one instance of
socket_event_poll_in will be aware of this object. IOW, multiple instances
of socket_event_poll_in will get different pollin objects. So, the only way
pollin->iobref could be shared across multiple invocations of
socket_event_poll_in is due to common shared object priv->incoming.iobref.
But that too is sanitized by the time __socket_proto_state_machine
completes and __socket_proto_state_machine is executed under lock. So, I
don't see how two different concurrent codepaths can get hold of same
iobref.
I find the one of recent patch
>
SHA-1: f747d55a7fd364e2b9a74fe40360ab3cb7b11537
>
>
>
> ** socket: fix issue on concurrent handle of a socket*
>
>
>
> I think the point is to avoid the concurrent handling of the same socket
> at the same time, but after my test with this patch this problem also
> exists, so I think event_handled is still called too early to allow
> concurrent handling of the same socket happen, and after move it to the end
> of socket_event_poll this glusterd stuck issue disappeared.
>
> cynthia
>
> *From:* Raghavendra Gowdappa <rgowdapp at redhat.com>
> *Sent:* Monday, April 15, 2019 2:36 PM
> *To:* Zhou, Cynthia (NSB - CN/Hangzhou) <cynthia.zhou at nokia-sbell.com>
> *Cc:* gluster-devel at gluster.org
> *Subject:* Re: glusterd stuck for glusterfs with version 3.12.15
>
>
>
>
>
>
>
> On Mon, Apr 15, 2019 at 11:08 AM Zhou, Cynthia (NSB - CN/Hangzhou) <
> cynthia.zhou at nokia-sbell.com> wrote:
>
> Ok, thanks for your comment!
>
>
>
> cynthia
>
>
>
> *From:* Raghavendra Gowdappa <rgowdapp at redhat.com>
> *Sent:* Monday, April 15, 2019 11:52 AM
> *To:* Zhou, Cynthia (NSB - CN/Hangzhou) <cynthia.zhou at nokia-sbell.com>
> *Cc:* gluster-devel at gluster.org
> *Subject:* Re: glusterd stuck for glusterfs with version 3.12.15
>
>
>
> Cynthia,
>
>
>
> On Mon, Apr 15, 2019 at 8:10 AM Zhou, Cynthia (NSB - CN/Hangzhou) <
> cynthia.zhou at nokia-sbell.com> wrote:
>
> Hi,
>
> I made a patch and according to my test, this glusterd stuck issue
> disappear with my patch. Only need to move event_handled to the end of
> socket_event_poll_in function.
>
>
>
> --- a/rpc/rpc-transport/socket/src/socket.c
>
> +++ b/rpc/rpc-transport/socket/src/socket.c
>
> @@ -2305,9 +2305,9 @@ socket_event_poll_in (rpc_transport_t *this,
> gf_boolean_t notify_handled)
>
> }
>
>
>
> - if (notify_handled && (ret != -1))
>
> - event_handled (ctx->event_pool, priv->sock, priv->idx,
>
> - priv->gen);
>
> @@ -2330,6 +2327,9 @@ socket_event_poll_in (rpc_transport_t *this,
> gf_boolean_t notify_handled)
>
> }
>
> pthread_mutex_unlock (&priv->notify.lock);
>
> }
>
> + if (notify_handled && (ret != -1))
>
> + event_handled (ctx->event_pool, priv->sock, priv->idx,
>
> + priv->gen);
>
>
>
> Thanks for this tip. Though this helps in fixing the hang, this change has
> performance impact. Moving event_handled to end of poll_in means, socket
> will be added back for polling of new events only _after_ the rpc is msg is
> processed by higher layers (like EC) and higher layers can have significant
> latency for processing the msg. Which means, socket will be out of polling
> for longer periods of time which decreases the throughput (number of msgs
> read per second) affecting performance. However, this experiment definitely
> indicates there is a codepath where event_handled is not called (and hence
> causing the hang). I'll go through this codepath again.
>
>
>
> Can you check whether patch [1] fixes the issue you are seeing?
>
>
>
> [1] https://review.gluster.org/#/c/glusterfs/+/22566
>
>
>
>
>
> Thanks for that experiment :).
>
>
>
> return ret;
>
> }
>
>
>
> cynthia
>
> *From:* Zhou, Cynthia (NSB - CN/Hangzhou)
> *Sent:* Tuesday, April 09, 2019 3:57 PM
> *To:* 'Raghavendra Gowdappa' <rgowdapp at redhat.com>
> *Cc:* gluster-devel at gluster.org
> *Subject:* RE: glusterd stuck for glusterfs with version 3.12.15
>
>
>
> Can you figure out some possible reason why iobref is corrupted, is it
> possible that thread 8 has called poll in and iobref has been relased, but
> the lock within it is not properly released (as I can not find any free
> lock operation in iobref_destroy), then thread 9 called rpc_transport_pollin_destroy
> again, and so stuck on this lock
>
> Also, there should not be two thread handling the same socket at the same
> time, although there has been a patch claimed to tackle this issue.
>
>
>
> cynthia
>
>
>
> *From:* Raghavendra Gowdappa <rgowdapp at redhat.com>
> *Sent:* Tuesday, April 09, 2019 3:52 PM
> *To:* Zhou, Cynthia (NSB - CN/Hangzhou) <cynthia.zhou at nokia-sbell.com>
> *Cc:* gluster-devel at gluster.org
> *Subject:* Re: glusterd stuck for glusterfs with version 3.12.15
>
>
>
>
>
>
>
> On Mon, Apr 8, 2019 at 7:42 AM Zhou, Cynthia (NSB - CN/Hangzhou) <
> cynthia.zhou at nokia-sbell.com> wrote:
>
> Hi glusterfs experts,
>
> Good day!
>
> In my test env, sometimes glusterd stuck issue happened, and it is not
> responding to any gluster commands, when I checked this issue I find that
> glusterd thread 9 and thread 8 is dealing with the same socket, I thought
> following patch should be able to solve this issue, however after I merged
> this patch this issue still exist. When I looked into this code, it seems
> socket_event_poll_in called event_handled before
> rpc_transport_pollin_destroy, I think this gives the chance for another
> poll for the exactly the same socket. And caused this glusterd stuck issue,
> also, I find there is no LOCK_DESTROY(&iobref->lock)
>
> In iobref_destroy, I think it is better to add destroy lock.
>
> Following is the gdb info when this issue happened, I would like to know
> your opinion on this issue, thanks!
>
>
>
> SHA-1: f747d55a7fd364e2b9a74fe40360ab3cb7b11537
>
>
>
> ** socket: fix issue on concurrent handle of a socket*
>
>
>
>
>
>
>
> *GDB INFO:*
>
> Thread 8 is blocked on pthread_cond_wait, and thread 9 is blocked in
> iobref_unref, I think
>
> Thread 9 (Thread 0x7f9edf7fe700 (LWP 1933)):
>
> #0 0x00007f9ee9fd785c in __lll_lock_wait () from /lib64/libpthread.so.0
>
> #1 0x00007f9ee9fda657 in __lll_lock_elision () from /lib64/libpthread.so.0
>
> #2 0x00007f9eeb24cae6 in iobref_unref (iobref=0x7f9ed00063b0) at
> iobuf.c:944
>
> #3 0x00007f9eeafd2f29 in rpc_transport_pollin_destroy
> (pollin=0x7f9ed00452d0) at rpc-transport.c:123
>
> #4 0x00007f9ee4fbf319 in socket_event_poll_in (this=0x7f9ed0049cc0,
> notify_handled=_gf_true) at socket.c:2322
>
> #5 0x00007f9ee4fbf932 in socket_event_handler (*fd=36, idx=27, gen=4,
> data=0x7f9ed0049cc0, poll_in=1, poll_out=0, poll_err=0*) at socket.c:2471
>
> #6 0x00007f9eeb2825d4 in event_dispatch_epoll_handler
> (event_pool=0x17feb00, event=0x7f9edf7fde84) at event-epoll.c:583
>
> #7 0x00007f9eeb2828ab in event_dispatch_epoll_worker (data=0x180d0c0) at
> event-epoll.c:659
>
> #8 0x00007f9ee9fce5da in start_thread () from /lib64/libpthread.so.0
>
> #9 0x00007f9ee98a4eaf in clone () from /lib64/libc.so.6
>
>
>
> Thread 8 (Thread 0x7f9edffff700 (LWP 1932)):
>
> #0 0x00007f9ee9fd785c in __lll_lock_wait () from /lib64/libpthread.so.0
>
> #1 0x00007f9ee9fd2b42 in __pthread_mutex_cond_lock () from
> /lib64/libpthread.so.0
>
> #2 0x00007f9ee9fd44a8 in pthread_cond_wait@@GLIBC_2.3.2 () from
> /lib64/libpthread.so.0
>
> #3 0x00007f9ee4fbadab in socket_event_poll_err (this=0x7f9ed0049cc0,
> gen=4, idx=27) at socket.c:1201
>
> #4 0x00007f9ee4fbf99c in socket_event_handler (*fd=36, idx=27, gen=4,
> data=0x7f9ed0049cc0, poll_in=1, poll_out=0, poll_err=0*) at socket.c:2480
>
> #5 0x00007f9eeb2825d4 in event_dispatch_epoll_handler
> (event_pool=0x17feb00, event=0x7f9edfffee84) at event-epoll.c:583
>
> #6 0x00007f9eeb2828ab in event_dispatch_epoll_worker (data=0x180cf20) at
> event-epoll.c:659
>
> #7 0x00007f9ee9fce5da in start_thread () from /lib64/libpthread.so.0
>
> #8 0x00007f9ee98a4eaf in clone () from /lib64/libc.so.6
>
>
>
> (gdb) thread 9
>
> [Switching to thread 9 (Thread 0x7f9edf7fe700 (LWP 1933))]
>
> #0 0x00007f9ee9fd785c in __lll_lock_wait () from /lib64/libpthread.so.0
>
> (gdb) bt
>
> #0 0x00007f9ee9fd785c in __lll_lock_wait () from /lib64/libpthread.so.0
>
> #1 0x00007f9ee9fda657 in __lll_lock_elision () from /lib64/libpthread.so.0
>
> #2 0x00007f9eeb24cae6 in iobref_unref (iobref=0x7f9ed00063b0) at
> iobuf.c:944
>
> #3 0x00007f9eeafd2f29 in rpc_transport_pollin_destroy
> (pollin=0x7f9ed00452d0) at rpc-transport.c:123
>
> #4 0x00007f9ee4fbf319 in socket_event_poll_in (this=0x7f9ed0049cc0,
> notify_handled=_gf_true) at socket.c:2322
>
> #5 0x00007f9ee4fbf932 in socket_event_handler (fd=36, idx=27, gen=4,
> data=0x7f9ed0049cc0, poll_in=1, poll_out=0, poll_err=0) at socket.c:2471
>
> #6 0x00007f9eeb2825d4 in event_dispatch_epoll_handler
> (event_pool=0x17feb00, event=0x7f9edf7fde84) at event-epoll.c:583
>
> #7 0x00007f9eeb2828ab in event_dispatch_epoll_worker (data=0x180d0c0) at
> event-epoll.c:659
>
> #8 0x00007f9ee9fce5da in start_thread () from /lib64/libpthread.so.0
>
> #9 0x00007f9ee98a4eaf in clone () from /lib64/libc.so.6
>
> (gdb) frame 2
>
> #2 0x00007f9eeb24cae6 in iobref_unref (iobref=0x7f9ed00063b0) at
> iobuf.c:944
>
> 944 iobuf.c: No such file or directory.
>
> (gdb) print *iobref
>
> $1 = {lock = {spinlock = 2, mutex = {__data = {__lock = 2, __count = 222,
> __owner = -2120437760, __nusers = 1, __kind = 8960, __spins = 512,
>
> __elision = 0, __list = {__prev = 0x4000, __next =
> 0x7f9ed00063b000}},
>
> __size =
> "\002\000\000\000\336\000\000\000\000\260\234\201\001\000\000\000\000#\000\000\000\002\000\000\000@
> \000\000\000\000\000\000\000\260c\000О\177", __align = 953482739714}},
> ref = -256, iobrefs = 0xffffffffffffffff, alloced = -1, used = -1}
>
>
>
> looks like the iobref is corrupted here. It seems to be a use-after-free
> issue. We need to dig into why a freed iobref is being accessed here.
>
>
>
> (gdb) quit
>
> A debugging session is active.
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-devel/attachments/20190425/c20f40e6/attachment-0001.html>
More information about the Gluster-devel
mailing list