<div dir="ltr"><div dir="ltr"><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, Jan 29, 2019 at 12:11 PM Zhou, Cynthia (NSB - CN/Hangzhou) <<a href="mailto:cynthia.zhou@nokia-sbell.com">cynthia.zhou@nokia-sbell.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div lang="ZH-CN">
<div class="gmail-m_5189538375006751569WordSection1">
<p class="MsoNormal"><span lang="EN-US">Hi, <u></u><u></u></span></p>
<p class="MsoNormal"><span lang="EN-US">We are using glusterfs version 3.12 for 3 brick I find that occasionally after reboot all 3 sn nodes simultaneously, the glusterd process on one sn nodes may get stuck, when you try to execute glusterd command it does
not response.<u></u><u></u></span></p>
<p class="MsoNormal"><span lang="EN-US"><u></u> <u></u></span></p>
<p class="MsoNormal"><span lang="EN-US">Following is the glusterd stuck log and the gdb info of the stuck glusterd process,<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:7.5pt" lang="EN-US">[2019-01-28 14:38:24.999329] I [rpc-clnt.c:1048:rpc_clnt_connection_init] 0-snapd: setting frame-timeout to 600<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:7.5pt" lang="EN-US">[2019-01-28 14:38:24.999450] I [rpc-clnt.c:1048:rpc_clnt_connection_init] 0-snapd: setting frame-timeout to 600<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:7.5pt" lang="EN-US">[2019-01-28 14:38:24.999597] I [MSGID: 106493] [glusterd-rpc-ops.c:701:__glusterd_friend_update_cbk] 0-management: Received ACC from uuid: a4315121-e127-42ca-9869-0fe451216a80<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:7.5pt" lang="EN-US">[2019-01-28 14:38:24.999692] I [MSGID: 106493] [glusterd-rpc-ops.c:486:__glusterd_friend_add_cbk] 0-glusterd: Received ACC from uuid: 7f694d7e-7613-4298-8da1-50cbb73ed47e, host: mn-0.local, port:
0<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:7.5pt" lang="EN-US">[2019-01-28 14:38:25.010624] W [socket.c:593:__socket_rwv] 0-management: readv on <a href="http://192.168.1.14:24007" target="_blank">192.168.1.14:24007</a> failed (No data available)<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:7.5pt" lang="EN-US">[2019-01-28 14:38:25.010774] I [MSGID: 106004] [glusterd-handler.c:6317:__glusterd_peer_rpc_notify] 0-management: Peer <mn-0.local> (<7f694d7e-7613-4298-8da1-50cbb73ed47e>), in state <Peer in Cluster>,
has disconnected from glusterd.<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:7.5pt" lang="EN-US">[2019-01-28 14:38:25.010860] W [glusterd-locks.c:846:glusterd_mgmt_v3_unlock] (-->/usr/lib64/glusterfs/3.12.3/xlator/mgmt/glusterd.so(+0x1bba4) [0x7f1828aa0ba4] -->/usr/lib64/glusterfs/3.12.3/xlator/mgmt/glusterd.so(+0x2ed5d)
[0x7f1828ab3d5d] -->/usr/lib64/glusterfs/3.12.3/xlator/mgmt/glusterd.so(+0xf7635) [0x7f1828b7c635] ) 0-management: Lock for vol ccs not held<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:7.5pt" lang="EN-US">[2019-01-28 14:38:25.010903] W [MSGID: 106118] [glusterd-handler.c:6342:__glusterd_peer_rpc_notify] 0-management: Lock not released for ccs<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:7.5pt" lang="EN-US">[2019-01-28 14:38:25.010931] W [glusterd-locks.c:846:glusterd_mgmt_v3_unlock] (-->/usr/lib64/glusterfs/3.12.3/xlator/mgmt/glusterd.so(+0x1bba4) [0x7f1828aa0ba4] -->/usr/lib64/glusterfs/3.12.3/xlator/mgmt/glusterd.so(+0x2ed5d)
[0x7f1828ab3d5d] -->/usr/lib64/glusterfs/3.12.3/xlator/mgmt/glusterd.so(+0xf7635) [0x7f1828b7c635] ) 0-management: Lock for vol encryptfile not held<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:7.5pt" lang="EN-US">[2019-01-28 14:38:25.010945] W [MSGID: 106118] [glusterd-handler.c:6342:__glusterd_peer_rpc_notify] 0-management: Lock not released for encryptfile<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:7.5pt" lang="EN-US">[2019-01-28 14:38:25.010971] W [glusterd-locks.c:846:glusterd_mgmt_v3_unlock] (-->/usr/lib64/glusterfs/3.12.3/xlator/mgmt/glusterd.so(+0x1bba4) [0x7f1828aa0ba4] -->/usr/lib64/glusterfs/3.12.3/xlator/mgmt/glusterd.so(+0x2ed5d)
[0x7f1828ab3d5d] -->/usr/lib64/glusterfs/3.12.3/xlator/mgmt/glusterd.so(+0xf7635) [0x7f1828b7c635] ) 0-management: Lock for vol export not held<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:7.5pt" lang="EN-US">[2019-01-28 14:38:25.010983] W [MSGID: 106118] [glusterd-handler.c:6342:__glusterd_peer_rpc_notify] 0-management: Lock not released for export<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:7.5pt" lang="EN-US">[2019-01-28 14:38:25.011006] W [glusterd-locks.c:846:glusterd_mgmt_v3_unlock] (-->/usr/lib64/glusterfs/3.12.3/xlator/mgmt/glusterd.so(+0x1bba4) [0x7f1828aa0ba4] -->/usr/lib64/glusterfs/3.12.3/xlator/mgmt/glusterd.so(+0x2ed5d)
[0x7f1828ab3d5d] -->/usr/lib64/glusterfs/3.12.3/xlator/mgmt/glusterd.so(+0xf7635) [0x7f1828b7c635] ) 0-management: Lock for vol log not held<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:7.5pt" lang="EN-US">[2019-01-28 14:38:25.011046] W [MSGID: 106118] [glusterd-handler.c:6342:__glusterd_peer_rpc_notify] 0-management: Lock not released for log<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:7.5pt" lang="EN-US">[2019-01-28 14:38:25.011070] W [glusterd-locks.c:846:glusterd_mgmt_v3_unlock] (-->/usr/lib64/glusterfs/3.12.3/xlator/mgmt/glusterd.so(+0x1bba4) [0x7f1828aa0ba4] -->/usr/lib64/glusterfs/3.12.3/xlator/mgmt/glusterd.so(+0x2ed5d)
[0x7f1828ab3d5d] -->/usr/lib64/glusterfs/3.12.3/xlator/mgmt/glusterd.so(+0xf7635) [0x7f1828b7c635] ) 0-management: Lock for vol mstate not held<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:7.5pt" lang="EN-US">[2019-01-28 14:38:25.011082] W [MSGID: 106118] [glusterd-handler.c:6342:__glusterd_peer_rpc_notify] 0-management: Lock not released for mstate<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:7.5pt" lang="EN-US">[2019-01-28 14:38:25.011104] W [glusterd-locks.c:846:glusterd_mgmt_v3_unlock] (-->/usr/lib64/glusterfs/3.12.3/xlator/mgmt/glusterd.so(+0x1bba4) [0x7f1828aa0ba4] -->/usr/lib64/glusterfs/3.12.3/xlator/mgmt/glusterd.so(+0x2ed5d)
[0x7f1828ab3d5d] -->/usr/lib64/glusterfs/3.12.3/xlator/mgmt/glusterd.so(+0xf7635) [0x7f1828b7c635] ) 0-management: Lock for vol services not held<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:7.5pt" lang="EN-US">[2019-01-28 14:38:25.011115] W [MSGID: 106118] [glusterd-handler.c:6342:__glusterd_peer_rpc_notify] 0-management: Lock not released for services<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:7.5pt" lang="EN-US">[2019-01-28 14:38:25.011268] E [rpc-clnt.c:350:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x1e9)[0x7f182df24a99] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1f9)[0x7f182dce6f27]
(--> /lib64/libgfrpc.so.0<span style="color:red">(saved_frames_destroy</span>+0x1f)[0x7f182dce701a] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x11e)[0x7f182dce756e] (--> /lib64/libgfrpc.so.0(+0x12f9e)[0x7f182dce7f9e] ))))) 0-management: forced
unwinding frame type(Peer mgmt) op(--(4)) called at 2019-01-28 14:38:24.999851 (xid=0x7)<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:7.5pt" lang="EN-US">[2019-01-28 14:38:25.011286] E [MSGID: 106158] [glusterd-rpc-ops.c:684:__glusterd_friend_update_cbk] 0-management: RPC Error<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:7.5pt" lang="EN-US">[2019-01-28 14:38:25.011301] I [MSGID: 106493] [glusterd-rpc-ops.c:701:__glusterd_friend_update_cbk] 0-management: Received RJT from uuid: 00000000-0000-0000-0000-000000000000<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:7.5pt" lang="EN-US">[2019-01-28 14:38:25.011424] I [MSGID: 106006] [glusterd-svc-mgmt.c:328:glusterd_svc_common_rpc_notify] 0-management: glustershd has connected with glusterd.<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:7.5pt" lang="EN-US">[2019-01-28 14:38:25.011599] I [MSGID: 106143] [glusterd-pmap.c:295:pmap_registry_bind] 0-pmap: adding brick /mnt/bricks/ccs/brick on port 53952<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:7.5pt" lang="EN-US">[2019-01-28 14:38:25.011862] I [MSGID: 106492] [glusterd-handler.c:2718:<span style="color:red">__glusterd_handle_friend_update</span>] 0-glusterd: Received friend update from uuid: 7f694d7e-7613-4298-8da1-50cbb73ed47e<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:7.5pt" lang="EN-US">[2019-01-28 14:38:25.021058] I [MSGID: 106502] [glusterd-handler.c:2763:__glusterd_handle_friend_update] 0-management: Received my uuid as Friend<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:7.5pt" lang="EN-US">[2019-01-28 14:38:25.021090] I [socket.c:3704:socket_submit_reply] 0-socket.management: not connected (priv->connected = -1)<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:7.5pt" lang="EN-US">[2019-01-28 14:38:25.021099] E [rpcsvc.c:1364:rpcsvc_submit_generic] 0-rpc-service: failed to submit message (XID: 0xa, Program: GlusterD svc peer, ProgVers: 2, Proc: 4) to rpc-transport (socket.management)<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:7.5pt" lang="EN-US">[2019-01-28 14:38:25.021108] E [MSGID: 106430] [glusterd-utils.c:568:glusterd_submit_reply] 0-glusterd: Reply submission failed<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:7.5pt" lang="EN-US">[2019-01-28 14:38:25.021126] E [rpcsvc.c:559:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:7.5pt" lang="EN-US">[2019-01-28 14:38:25.021135] E [rpcsvc.c:1364:rpcsvc_submit_generic] 0-rpc-service: failed to submit message (XID: 0xa, Program: GlusterD svc peer, ProgVers: 2, Proc: 4) to rpc-transport (socket.management)<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:7.5pt" lang="EN-US">[2019-01-28 14:38:25.021147] W [rpcsvc.c:565:rpcsvc_check_and_reply_error] 0-rpcsvc: failed to queue error reply<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:7.5pt" lang="EN-US"><u></u> <u></u></span></p>
<p class="MsoNormal"><span style="font-size:7.5pt" lang="EN-US"><u></u> <u></u></span></p>
<p class="MsoNormal"><span style="font-size:7.5pt" lang="EN-US">Gdb info the stuck thread of glusterd is:<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:7.5pt" lang="EN-US">(gdb) thread 8<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:7.5pt" lang="EN-US">[Switching to thread 8 (Thread 0x7f1826ec2700 (LWP 2418))]<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:7.5pt" lang="EN-US">#0 0x00007f182cce787c in __lll_lock_wait () from /lib64/libpthread.so.0<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:7.5pt" lang="EN-US">(gdb) bt<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:7.5pt" lang="EN-US">#0 0x00007f182cce787c in __lll_lock_wait () from /lib64/libpthread.so.0<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:7.5pt" lang="EN-US">#1 0x00007f182ccea677 in __lll_lock_elision () from /lib64/libpthread.so.0<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:7.5pt" lang="EN-US">#2 0x00007f182df5cae6 in iobref_unref () from /lib64/libglusterfs.so.0<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:7.5pt" lang="EN-US">#3 0x00007f182dce2f29 in rpc_transport_pollin_destroy () from /lib64/libgfrpc.so.0<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:7.5pt" lang="EN-US">#4 0x00007f1827ccf319 in socket_event_poll_in () from /usr/lib64/glusterfs/3.12.3/rpc-transport/socket.so<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:7.5pt" lang="EN-US">#5 0x00007f1827ccf932 in socket_event_handler () from /usr/lib64/glusterfs/3.12.3/rpc-transport/socket.so<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:7.5pt" lang="EN-US">#6 0x00007f182df925d4 in event_dispatch_epoll_handler () from /lib64/libglusterfs.so.0<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:7.5pt" lang="EN-US">#7 0x00007f182df928ab in event_dispatch_epoll_worker () from /lib64/libglusterfs.so.0<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:7.5pt" lang="EN-US">#8 0x00007f182ccde5da in start_thread () from /lib64/libpthread.so.0<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:7.5pt" lang="EN-US">#9 0x00007f182c5b4e8f in clone () from /lib64/libc.so.6<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:7.5pt" lang="EN-US">(gdb) thread 9<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:7.5pt" lang="EN-US">[Switching to thread 9 (Thread 0x7f18266c1700 (LWP 2419))]<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:7.5pt" lang="EN-US">#0 0x00007f182cce787c in __lll_lock_wait () from /lib64/libpthread.so.0<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:7.5pt" lang="EN-US">(gdb) bt<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:7.5pt" lang="EN-US">#0 0x00007f182cce787c in __lll_lock_wait () from /lib64/libpthread.so.0<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:7.5pt" lang="EN-US">#1 0x00007f182cce2b42 in __pthread_mutex_cond_lock () from /lib64/libpthread.so.0<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:7.5pt" lang="EN-US">#2 0x00007f182cce44c8 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:7.5pt" lang="EN-US">#3 0x00007f1827ccadab in socket_event_poll_err () from /usr/lib64/glusterfs/3.12.3/rpc-transport/socket.so<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:7.5pt" lang="EN-US">#4 0x00007f1827ccf99c in socket_event_handler () from /usr/lib64/glusterfs/3.12.3/rpc-transport/socket.so<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:7.5pt" lang="EN-US">#5 0x00007f182df925d4 in event_dispatch_epoll_handler () from /lib64/libglusterfs.so.0<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:7.5pt" lang="EN-US">#6 0x00007f182df928ab in event_dispatch_epoll_worker () from /lib64/libglusterfs.so.0<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:7.5pt" lang="EN-US">#7 0x00007f182ccde5da in start_thread () from /lib64/libpthread.so.0<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:7.5pt" lang="EN-US">#8 0x00007f182c5b4e8f in clone () from /lib64/libc.so.6<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:7.5pt" lang="EN-US"><u></u> <u></u></span></p>
<p class="MsoNormal"><span style="font-size:7.5pt" lang="EN-US"><u></u> <u></u></span></p>
<p class="MsoNormal"><span style="font-size:7.5pt" lang="EN-US">Have you ever encounter this issue? From the gdb info it seem the epoll thread get dead lock.</span></p></div></div></blockquote><div><br></div><div>Its not a deadlock. Pollerr is serialized with any ongoing pollin/pollout request processing. The problem here is iobuf_unref is waiting on a lock and hence pollin didn't complete. Question is why is iobuf_unref is stuck? Who has the lock of iobuf?</div><div> <br></div></div></div>