[Gluster-users] glusterfsd won't restart on one brick

Joe Julian joe at julianfamily.org
Wed Jul 31 17:29:20 UTC 2013

To kill a zombie process, you have to kill the parent process.

ps -p 23744 -o ppid=

If the result is 1, then you are stuck rebooting. Otherwise, kill that process.

Deleting a filename does not close the named pipe, so that caused the failure below.

Joel Young <jdy at cryregarder.com> wrote:
>On Tue, Jul 30, 2013 at 10:49 PM, Kaushal M <kshlmster at gmail.com>
>> I think I've found the problem. The problem is not with the brick
>port, but instead with
>> the unix domain socket used for communication between glusterd and
>Makes sense.
>> So this is most likely due the zombie process 23744 still listening
>on the unix
>> domain socket. Only one bind can be performed on a unix domain
>socket. If
>> another bind is tried we get an EADDRINUSE error.
>> Can you kill 23744, remove
>> and restart the brick using 'gluster volume start'. This should allow
>it to start.
>It isn't possible to kill 23744 as it is zombie.  fuser on the socket
>doesn't report any
>users.  I did remove /var/run/5a53...
>"gluster volume start home" doesn't work as the volume is already
>started (and mounted
>and in use by users so I'd rather not shutdown the cluster).  I tried a
>"systemctl restart glusterd.service" which did not restart the brick
>but did leave the following
>in /var/log/bricks/lhome-gluster_home.log:
>[2013-07-31 16:04:59.716771] I [glusterfsd.c:1910:main]
>0-/usr/sbin/glusterfsd: Started running /usr/sbin/glusterfsd version
>3.4.0 (/usr/sbin/glusterfsd -s ir2 --volfile-id
>home.ir2.lhome-gluster_home -p
>/var/lib/glusterd/vols/home/run/ir2-lhome-gluster_home.pid -S
>/var/run/5a538b707ce5dbf525ba6d01835863bb.socket --brick-name
>/lhome/gluster_home -l
>/var/log/glusterfs/bricks/lhome-gluster_home.log --xlator-option
>--brick-port 49157 --xlator-option home-server.listen-port=49157)
>[2013-07-31 16:04:59.719901] I [socket.c:3480:socket_init]
>0-socket.glusterfsd: SSL support is NOT enabled
>[2013-07-31 16:04:59.719936] I [socket.c:3495:socket_init]
>0-socket.glusterfsd: using system polling thread
>[2013-07-31 16:04:59.720242] I [socket.c:3480:socket_init]
>0-glusterfs: SSL support is NOT enabled
>[2013-07-31 16:04:59.720256] I [socket.c:3495:socket_init]
>0-glusterfs: using system polling thread
>[2013-07-31 16:04:59.752491] I [graph.c:239:gf_add_cmdline_options]
>0-home-server: adding option 'listen-port' for volume 'home-server'
>with value '49157'
>[2013-07-31 16:04:59.752514] I [graph.c:239:gf_add_cmdline_options]
>0-home-posix: adding option 'glusterd-uuid' for volume 'home-posix'
>with value '9d2d74bf-9055-47a6-b3df-8c2057ea1dd9'
>[2013-07-31 16:04:59.753960] W [options.c:848:xl_opt_validate]
>0-home-server: option 'listen-port' is deprecated, preferred is
>'transport.socket.listen-port', continuing with correction
>[2013-07-31 16:04:59.754000] I [socket.c:3480:socket_init]
>0-tcp.home-server: SSL support is NOT enabled
>[2013-07-31 16:04:59.754025] I [socket.c:3495:socket_init]
>0-tcp.home-server: using system polling thread
>[2013-07-31 16:04:59.754075] E [socket.c:695:__socket_server_bind]
>0-tcp.home-server: binding to  failed: Address already in use
>[2013-07-31 16:04:59.754091] E [socket.c:698:__socket_server_bind]
>0-tcp.home-server: Port is already in use
>[2013-07-31 16:04:59.754108] W [rpcsvc.c:1394:rpcsvc_transport_create]
>0-rpc-service: listening on transport failed
>[2013-07-31 16:04:59.754128] W [server.c:1092:init] 0-home-server:
>creation of listener failed
>[2013-07-31 16:04:59.754140] E [xlator.c:390:xlator_init]
>0-home-server: Initialization of volume 'home-server' failed, review
>your volfile again
>[2013-07-31 16:04:59.754151] E [graph.c:292:glusterfs_graph_init]
>0-home-server: initializing translator failed
>[2013-07-31 16:04:59.754162] E [graph.c:479:glusterfs_graph_activate]
>0-graph: init failed
>[2013-07-31 16:04:59.754404] W [glusterfsd.c:1002:cleanup_and_exit]
>(-->/lib64/libgfrpc.so.0(rpc_clnt_handle_reply+0x90) [0x7f5794b5db10]
>(-->/usr/sbin/glusterfsd(mgmt_getspec_cbk+0x2fd) [0x7f5795216bcd]
>[0x7f5795212603]))) 0-: received signum (0), shutting down
>Which seems like it worked and then tried again and failed?
