[Gluster-users] query about glusterd epoll thread get stuck

Raghavendra Gowdappa rgowdapp at redhat.com
Tue Jan 29 11:32:12 UTC 2019


On Tue, Jan 29, 2019 at 12:11 PM Zhou, Cynthia (NSB - CN/Hangzhou) <
cynthia.zhou at nokia-sbell.com> wrote:

> Hi,
>
> We are using glusterfs version 3.12 for 3 brick I find that occasionally
> after reboot all 3 sn nodes simultaneously, the glusterd process on one sn
> nodes may get stuck, when you try to execute glusterd command it does not
> response.
>
>
>
> Following is the glusterd stuck log and the gdb info of the stuck glusterd
> process,
>
> [2019-01-28 14:38:24.999329] I [rpc-clnt.c:1048:rpc_clnt_connection_init]
> 0-snapd: setting frame-timeout to 600
>
> [2019-01-28 14:38:24.999450] I [rpc-clnt.c:1048:rpc_clnt_connection_init]
> 0-snapd: setting frame-timeout to 600
>
> [2019-01-28 14:38:24.999597] I [MSGID: 106493]
> [glusterd-rpc-ops.c:701:__glusterd_friend_update_cbk] 0-management:
> Received ACC from uuid: a4315121-e127-42ca-9869-0fe451216a80
>
> [2019-01-28 14:38:24.999692] I [MSGID: 106493]
> [glusterd-rpc-ops.c:486:__glusterd_friend_add_cbk] 0-glusterd: Received ACC
> from uuid: 7f694d7e-7613-4298-8da1-50cbb73ed47e, host: mn-0.local, port: 0
>
> [2019-01-28 14:38:25.010624] W [socket.c:593:__socket_rwv] 0-management:
> readv on 192.168.1.14:24007 failed (No data available)
>
> [2019-01-28 14:38:25.010774] I [MSGID: 106004]
> [glusterd-handler.c:6317:__glusterd_peer_rpc_notify] 0-management: Peer
> <mn-0.local> (<7f694d7e-7613-4298-8da1-50cbb73ed47e>), in state <Peer in
> Cluster>, has disconnected from glusterd.
>
> [2019-01-28 14:38:25.010860] W
> [glusterd-locks.c:846:glusterd_mgmt_v3_unlock]
> (-->/usr/lib64/glusterfs/3.12.3/xlator/mgmt/glusterd.so(+0x1bba4)
> [0x7f1828aa0ba4]
> -->/usr/lib64/glusterfs/3.12.3/xlator/mgmt/glusterd.so(+0x2ed5d)
> [0x7f1828ab3d5d]
> -->/usr/lib64/glusterfs/3.12.3/xlator/mgmt/glusterd.so(+0xf7635)
> [0x7f1828b7c635] ) 0-management: Lock for vol ccs not held
>
> [2019-01-28 14:38:25.010903] W [MSGID: 106118]
> [glusterd-handler.c:6342:__glusterd_peer_rpc_notify] 0-management: Lock not
> released for ccs
>
> [2019-01-28 14:38:25.010931] W
> [glusterd-locks.c:846:glusterd_mgmt_v3_unlock]
> (-->/usr/lib64/glusterfs/3.12.3/xlator/mgmt/glusterd.so(+0x1bba4)
> [0x7f1828aa0ba4]
> -->/usr/lib64/glusterfs/3.12.3/xlator/mgmt/glusterd.so(+0x2ed5d)
> [0x7f1828ab3d5d]
> -->/usr/lib64/glusterfs/3.12.3/xlator/mgmt/glusterd.so(+0xf7635)
> [0x7f1828b7c635] ) 0-management: Lock for vol encryptfile not held
>
> [2019-01-28 14:38:25.010945] W [MSGID: 106118]
> [glusterd-handler.c:6342:__glusterd_peer_rpc_notify] 0-management: Lock not
> released for encryptfile
>
> [2019-01-28 14:38:25.010971] W
> [glusterd-locks.c:846:glusterd_mgmt_v3_unlock]
> (-->/usr/lib64/glusterfs/3.12.3/xlator/mgmt/glusterd.so(+0x1bba4)
> [0x7f1828aa0ba4]
> -->/usr/lib64/glusterfs/3.12.3/xlator/mgmt/glusterd.so(+0x2ed5d)
> [0x7f1828ab3d5d]
> -->/usr/lib64/glusterfs/3.12.3/xlator/mgmt/glusterd.so(+0xf7635)
> [0x7f1828b7c635] ) 0-management: Lock for vol export not held
>
> [2019-01-28 14:38:25.010983] W [MSGID: 106118]
> [glusterd-handler.c:6342:__glusterd_peer_rpc_notify] 0-management: Lock not
> released for export
>
> [2019-01-28 14:38:25.011006] W
> [glusterd-locks.c:846:glusterd_mgmt_v3_unlock]
> (-->/usr/lib64/glusterfs/3.12.3/xlator/mgmt/glusterd.so(+0x1bba4)
> [0x7f1828aa0ba4]
> -->/usr/lib64/glusterfs/3.12.3/xlator/mgmt/glusterd.so(+0x2ed5d)
> [0x7f1828ab3d5d]
> -->/usr/lib64/glusterfs/3.12.3/xlator/mgmt/glusterd.so(+0xf7635)
> [0x7f1828b7c635] ) 0-management: Lock for vol log not held
>
> [2019-01-28 14:38:25.011046] W [MSGID: 106118]
> [glusterd-handler.c:6342:__glusterd_peer_rpc_notify] 0-management: Lock not
> released for log
>
> [2019-01-28 14:38:25.011070] W
> [glusterd-locks.c:846:glusterd_mgmt_v3_unlock]
> (-->/usr/lib64/glusterfs/3.12.3/xlator/mgmt/glusterd.so(+0x1bba4)
> [0x7f1828aa0ba4]
> -->/usr/lib64/glusterfs/3.12.3/xlator/mgmt/glusterd.so(+0x2ed5d)
> [0x7f1828ab3d5d]
> -->/usr/lib64/glusterfs/3.12.3/xlator/mgmt/glusterd.so(+0xf7635)
> [0x7f1828b7c635] ) 0-management: Lock for vol mstate not held
>
> [2019-01-28 14:38:25.011082] W [MSGID: 106118]
> [glusterd-handler.c:6342:__glusterd_peer_rpc_notify] 0-management: Lock not
> released for mstate
>
> [2019-01-28 14:38:25.011104] W
> [glusterd-locks.c:846:glusterd_mgmt_v3_unlock]
> (-->/usr/lib64/glusterfs/3.12.3/xlator/mgmt/glusterd.so(+0x1bba4)
> [0x7f1828aa0ba4]
> -->/usr/lib64/glusterfs/3.12.3/xlator/mgmt/glusterd.so(+0x2ed5d)
> [0x7f1828ab3d5d]
> -->/usr/lib64/glusterfs/3.12.3/xlator/mgmt/glusterd.so(+0xf7635)
> [0x7f1828b7c635] ) 0-management: Lock for vol services not held
>
> [2019-01-28 14:38:25.011115] W [MSGID: 106118]
> [glusterd-handler.c:6342:__glusterd_peer_rpc_notify] 0-management: Lock not
> released for services
>
> [2019-01-28 14:38:25.011268] E [rpc-clnt.c:350:saved_frames_unwind] (-->
> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x1e9)[0x7f182df24a99] (-->
> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1f9)[0x7f182dce6f27] (-->
> /lib64/libgfrpc.so.0(saved_frames_destroy+0x1f)[0x7f182dce701a] (-->
> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x11e)[0x7f182dce756e]
> (--> /lib64/libgfrpc.so.0(+0x12f9e)[0x7f182dce7f9e] ))))) 0-management:
> forced unwinding frame type(Peer mgmt) op(--(4)) called at 2019-01-28
> 14:38:24.999851 (xid=0x7)
>
> [2019-01-28 14:38:25.011286] E [MSGID: 106158]
> [glusterd-rpc-ops.c:684:__glusterd_friend_update_cbk] 0-management: RPC
> Error
>
> [2019-01-28 14:38:25.011301] I [MSGID: 106493]
> [glusterd-rpc-ops.c:701:__glusterd_friend_update_cbk] 0-management:
> Received RJT from uuid: 00000000-0000-0000-0000-000000000000
>
> [2019-01-28 14:38:25.011424] I [MSGID: 106006]
> [glusterd-svc-mgmt.c:328:glusterd_svc_common_rpc_notify] 0-management:
> glustershd has connected with glusterd.
>
> [2019-01-28 14:38:25.011599] I [MSGID: 106143]
> [glusterd-pmap.c:295:pmap_registry_bind] 0-pmap: adding brick
> /mnt/bricks/ccs/brick on port 53952
>
> [2019-01-28 14:38:25.011862] I [MSGID: 106492] [glusterd-handler.c:2718:
> __glusterd_handle_friend_update] 0-glusterd: Received friend update from
> uuid: 7f694d7e-7613-4298-8da1-50cbb73ed47e
>
> [2019-01-28 14:38:25.021058] I [MSGID: 106502]
> [glusterd-handler.c:2763:__glusterd_handle_friend_update] 0-management:
> Received my uuid as Friend
>
> [2019-01-28 14:38:25.021090] I [socket.c:3704:socket_submit_reply]
> 0-socket.management: not connected (priv->connected = -1)
>
> [2019-01-28 14:38:25.021099] E [rpcsvc.c:1364:rpcsvc_submit_generic]
> 0-rpc-service: failed to submit message (XID: 0xa, Program: GlusterD svc
> peer, ProgVers: 2, Proc: 4) to rpc-transport (socket.management)
>
> [2019-01-28 14:38:25.021108] E [MSGID: 106430]
> [glusterd-utils.c:568:glusterd_submit_reply] 0-glusterd: Reply submission
> failed
>
> [2019-01-28 14:38:25.021126] E [rpcsvc.c:559:rpcsvc_check_and_reply_error]
> 0-rpcsvc: rpc actor failed to complete successfully
>
> [2019-01-28 14:38:25.021135] E [rpcsvc.c:1364:rpcsvc_submit_generic]
> 0-rpc-service: failed to submit message (XID: 0xa, Program: GlusterD svc
> peer, ProgVers: 2, Proc: 4) to rpc-transport (socket.management)
>
> [2019-01-28 14:38:25.021147] W [rpcsvc.c:565:rpcsvc_check_and_reply_error]
> 0-rpcsvc: failed to queue error reply
>
>
>
>
>
> Gdb info the stuck thread of glusterd is:
>
> (gdb) thread 8
>
> [Switching to thread 8 (Thread 0x7f1826ec2700 (LWP 2418))]
>
> #0  0x00007f182cce787c in __lll_lock_wait () from /lib64/libpthread.so.0
>
> (gdb) bt
>
> #0  0x00007f182cce787c in __lll_lock_wait () from /lib64/libpthread.so.0
>
> #1  0x00007f182ccea677 in __lll_lock_elision () from /lib64/libpthread.so.0
>
> #2  0x00007f182df5cae6 in iobref_unref () from /lib64/libglusterfs.so.0
>
> #3  0x00007f182dce2f29 in rpc_transport_pollin_destroy () from
> /lib64/libgfrpc.so.0
>
> #4  0x00007f1827ccf319 in socket_event_poll_in () from
> /usr/lib64/glusterfs/3.12.3/rpc-transport/socket.so
>
> #5  0x00007f1827ccf932 in socket_event_handler () from
> /usr/lib64/glusterfs/3.12.3/rpc-transport/socket.so
>
> #6  0x00007f182df925d4 in event_dispatch_epoll_handler () from
> /lib64/libglusterfs.so.0
>
> #7  0x00007f182df928ab in event_dispatch_epoll_worker () from
> /lib64/libglusterfs.so.0
>
> #8  0x00007f182ccde5da in start_thread () from /lib64/libpthread.so.0
>
> #9  0x00007f182c5b4e8f in clone () from /lib64/libc.so.6
>
> (gdb) thread 9
>
> [Switching to thread 9 (Thread 0x7f18266c1700 (LWP 2419))]
>
> #0  0x00007f182cce787c in __lll_lock_wait () from /lib64/libpthread.so.0
>
> (gdb) bt
>
> #0  0x00007f182cce787c in __lll_lock_wait () from /lib64/libpthread.so.0
>
> #1  0x00007f182cce2b42 in __pthread_mutex_cond_lock () from
> /lib64/libpthread.so.0
>
> #2  0x00007f182cce44c8 in pthread_cond_wait@@GLIBC_2.3.2 () from
> /lib64/libpthread.so.0
>
> #3  0x00007f1827ccadab in socket_event_poll_err () from
> /usr/lib64/glusterfs/3.12.3/rpc-transport/socket.so
>
> #4  0x00007f1827ccf99c in socket_event_handler () from
> /usr/lib64/glusterfs/3.12.3/rpc-transport/socket.so
>
> #5  0x00007f182df925d4 in event_dispatch_epoll_handler () from
> /lib64/libglusterfs.so.0
>
> #6  0x00007f182df928ab in event_dispatch_epoll_worker () from
> /lib64/libglusterfs.so.0
>
> #7  0x00007f182ccde5da in start_thread () from /lib64/libpthread.so.0
>
> #8  0x00007f182c5b4e8f in clone () from /lib64/libc.so.6
>
>
>
>
>
> Have you ever encounter this issue? From the gdb info it seem the epoll
> thread get dead lock.
>

Its not a deadlock. Pollerr is serialized with any ongoing pollin/pollout
request processing. The problem here is iobuf_unref is waiting on a lock
and hence pollin didn't complete. Question is why is iobuf_unref is stuck?
Who has the lock of iobuf?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20190129/f72274e9/attachment.html>


More information about the Gluster-users mailing list