[Gluster-devel] sometimes when reboot node glustershd process does not come up.

Wed Feb 7 15:48:43 UTC 2018

Apologies for coming back late on this as I was occupied with some other
priority stuffs. I think your analysis is spot on.

To add on why we hit this race, when glusterd restarts,
glusterd_svcs_manager () is invoked from glusterd_compare_friend_data ()
and through glusterd_restart_bricks () where as in the case of latter this
function is invoked through a synctask which means you can end up with
multiple threads executing glusterd_restart_bricks () at one point of time
and with that parallelism it's quite possible that one thread tries to kill
the process and immediately the other tries to bring it up and since its
the glusterfsd process which cleans up the pidfile this race is exposed.

The good news is that I have a patch (mainly for one of the brick
multiplexing bug) https://review.gluster.org/#/c/19357 which actually
addresses this problem. Would appreciate if you can take a look at it,
apply this change and retest it.

On Wed, Jan 24, 2018 at 8:39 AM, Zhou, Cynthia (NSB - CN/Hangzhou) <
cynthia.zhou at nokia-sbell.com> wrote:

> Hi,
>         When we are doing glusterfs tests recently, we occasionally meet
> the following issue that, sometimes after when reboot sn node, the
> glustershd process does not come up.
>
> Status of volume: export
> Gluster process                             TCP Port  RDMA Port  Online
> Pid
> ------------------------------------------------------------
> ------------------
> Brick sn-0.local:/mnt/bricks/export/brick   49154     0          Y
> 1157
> Brick sn-1.local:/mnt/bricks/export/brick   49152     0          Y
> 1123
> Self-heal Daemon on localhost               N/A       N/A        Y
> 1642
> *Self-heal Daemon on sn-2.local              N/A       N/A        N
> N/A  *
> Self-heal Daemon on sn-1.local              N/A       N/A        Y
> 1877
>
> Task Status of Volume export
> ------------------------------------------------------------
> ------------------
> There are no active volume tasks
>
> [root at sn-0:/root]
>
>    When I check the glusterd.log on sn-2 I find that
> [2018-01-23 10:11:29.534588] I [MSGID: 106490] [glusterd-handler.c:2540:__
> glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from
> uuid: 7c736522-04c6-4677-85cf-e2ab5d61baad
> [2018-01-23 10:11:29.542764] I [MSGID: 106493] [glusterd-handler.c:3800:glusterd_xfer_friend_add_resp]
> 0-glusterd: Responded to sn-1.local (0), ret: 0, op_ret: 0
> [2018-01-23 10:11:29.561677] I [MSGID: 106492] [glusterd-handler.c:2718:__glusterd_handle_friend_update]
> 0-glusterd: Received friend update from uuid: 7c736522-04c6-4677-85cf-
> e2ab5d61baad
> [2018-01-23 10:11:29.561719] I [MSGID: 106502] [glusterd-handler.c:2763:__glusterd_handle_friend_update]
> 0-management: Received my uuid as Friend
> [2018-01-23 10:11:29.570655] I [MSGID: 106492] [glusterd-handler.c:2718:__glusterd_handle_friend_update]
> 0-glusterd: Received friend update from uuid: 7c736522-04c6-4677-85cf-
> e2ab5d61baad
> [2018-01-23 10:11:29.570686] I [MSGID: 106502] [glusterd-handler.c:2763:__glusterd_handle_friend_update]
> 0-management: Received my uuid as Friend
> [2018-01-23 10:11:29.579589] I [MSGID: 106493] [glusterd-rpc-ops.c:486:__glusterd_friend_add_cbk]
> 0-glusterd: Received ACC from uuid: 7c736522-04c6-4677-85cf-e2ab5d61baad,
> host: sn-1.local, port: 0
> [2018-01-23 10:11:29.588190] I [MSGID: 106493] [glusterd-rpc-ops.c:701:__glusterd_friend_update_cbk]
> 0-management: Received ACC from uuid: 7c736522-04c6-4677-85cf-e2ab5d61baad
> [2018-01-23 10:11:29.588258] I [MSGID: 106132] [glusterd-proc-mgmt.c:83:glusterd_proc_stop]
> 0-management: nfs already stopped
> [2018-01-23 10:11:29.588280] I [MSGID: 106568] [glusterd-svc-mgmt.c:238:glusterd_svc_stop]
> 0-management: nfs service is stopped
> [2018-01-23 10:11:29.588291] I [MSGID: 106600] [glusterd-nfs-svc.c:82:glusterd_nfssvc_manager]
> 0-management: nfs/server.so xlator is not installed
> [2018-01-23 10:11:30.091487] I [MSGID: 106568] [glusterd-proc-mgmt.c:87:glusterd_proc_stop]
> 0-management: Stopping glustershd daemon running in pid: 1101
> [2018-01-23 10:11:30.092964] W [socket.c:593:__socket_rwv] 0-glustershd:
> readv on /var/run/gluster/d6b222161f840fc2854cb04ae3d7d658.socket failed
> (No data available)
> [2018-01-23 10:11:31.091657] I [MSGID: 106568] [glusterd-proc-mgmt.c:110:glusterd_proc_stop]
> 0-management: second time kill 1101 //this log is added by me
> [2018-01-23 10:11:31.091711] I [MSGID: 106568] [glusterd-svc-mgmt.c:238:glusterd_svc_stop]
> 0-management: glustershd service is stopped
> [2018-01-23 10:11:31.097696] I [MSGID: 106493] [glusterd-rpc-ops.c:701:__glusterd_friend_update_cbk]
> 0-management: Received ACC from uuid: 7c736522-04c6-4677-85cf-e2ab5d61baad
> [2018-01-23 10:11:31.591881] I [MSGID: 106567] [glusterd-svc-mgmt.c:162:glusterd_svc_start]
> 0-management: found svc glustershd is already running
> [2018-01-23 10:11:35.592332] W [socket.c:3314:socket_connect]
> 0-glustershd: Ignore failed connection attempt on /var/run/gluster/
> d6b222161f840fc2854cb04ae3d7d658.socket, (No such file or directory)
> [2018-01-23 10:11:35.592757] I [MSGID: 106132] [glusterd-proc-mgmt.c:83:glusterd_proc_stop]
> 0-management: quotad already stopped
> [2018-01-23 10:11:35.592818] I [MSGID: 106568] [glusterd-svc-mgmt.c:238:glusterd_svc_stop]
> 0-management: quotad service is stopped
> [2018-01-23 10:11:35.593163] I [MSGID: 106132] [glusterd-proc-mgmt.c:83:glusterd_proc_stop]
> 0-management: bitd already stopped
> [2018-01-23 10:11:35.593189] I [MSGID: 106568] [glusterd-svc-mgmt.c:238:glusterd_svc_stop]
> 0-management: bitd service is stopped
>
> From glustershd.log
> [2018-01-23 10:11:29.535459] I [MSGID: 114018] [client.c:2285:client_rpc_notify]
> 0-mstate-client-1: disconnected from mstate-client-1. Client process will
> keep trying to connect to glusterd until brick's port is available
> [2018-01-23 10:11:29.535463] I [MSGID: 114018] [client.c:2285:client_rpc_notify]
> 0-services-client-1: disconnected from services-client-1. Client process
> will keep trying to connect to glusterd until brick's port is available
> [2018-01-23 10:11:30.091763] W [glusterfsd.c:1375:cleanup_and_exit]
> (-->/lib64/libpthread.so.0(+0x74a5) [0x7f318d7364a5]
> -->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xdc) [0x40a513]
> -->/usr/sbin/glusterfs(cleanup_and_exit+0x77) [0x4089ec] ) 0-: received
> signum (15), shutting down
>
> So, the story seems to go like this:
>
>
>    1. glusterd stop glustershd by sending signal SIGTEM
>    2. this does not work, at least from glusterd side it still find
>    glustershd pid file, so it think glustershd is still alive, so it try again
>    by sending SIGKILL
>    3. glusterd try to start glustershd however, it find glustershd pid
>    file still exists, so it think there is no need to start glustershd process
>    again
>    4. we end up with no glustershd process
>
>
> so should we also remove pid file in signal SIGKILL? Or should we add
> remove pid in glusterd_svc_stop when glusterd_proc_stop return 0?
> If I am not making this clear , welcome to discuss this issue with me ,
> looking forward for your reply!
> Best regards,
> *Cynthia **（周琳）*
> MBB SM HETRAN SW3 MATRIX
> Storage
> Mobile: +86 (0)18657188311
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-devel/attachments/20180207/56fc9f92/attachment.html>