[Bugs] [Bug 1506487] New: glusterfs / glusterfsd processes may not start properly upon reboot/restart

Thu Oct 26 08:11:25 UTC 2017

https://bugzilla.redhat.com/show_bug.cgi?id=1506487

            Bug ID: 1506487
           Summary: glusterfs / glusterfsd processes may not start
                    properly upon reboot/restart
           Product: GlusterFS
           Version: 3.12
         Component: glusterd
          Assignee: bugs at gluster.org
          Reporter: srn at nexatech.fr
                CC: bugs at gluster.org

Description of problem:

I have a fresh cluster of 30 nodes (Gluster 3.12.1 / CentOS 7.1 / VM), each
having two bricks (/mnt/dd1/brick & /mnt/dd2/brick), with a distributed
replicated volume (I reproduced the bug with replica 2 and 3).

When a node reboot or when its gluster processes are restarted, some of them
(glusterfs/glusterfsd) MAY die seconds after launch.

How reproducible:

Happens around 3 out of 4 times, in my setup.

Steps to Reproduce:

1. gluster volume create mysto replica 3 ... (see details for full command)
2. gluster volume start mysto
3. either reboot or restart process, on one of the node
3.1 reboot
3.2 /usr/share/glusterfs/scripts/stop-all-gluster-processes.sh; systemctl
restart glusterd

Actual results:

glusterd, glusterfs and two glusterfsd start, as expected, and then
glusterfs/glusterfsd processes stop randomly (either one of them, or two of
them or all of them) less than a minute later.

In the log, it seems that dying processes have always trouble communicating
with the main gluster daemon, which lead me to think that the problem is on
glusterd side, serving some requests but not some others :

W [socket.c:593:__socket_rwv] 0-glusterfs: readv on 10.45.4.2:24007 failed
(Connection reset by peer)
or
E [socket.c:2369:socket_connect_finish] 0-glusterfs: connection to
10.45.4.2:24007 failed (Connection timed out); disconnecting socket

10.45.4.2 being the IP of the current node.
See details for full log.

Expected results:

I expect the node to properly start all its processes, and then to be ready to
serve clients.

Additional info:

* glusterd daemon always survives;
* If I then launch, from any node, "gluster volume start mysto force", missing
processes are properly launched and survive;
* I originaly had the problem doing brick healing tests, but I just reproduced
the problem without writing anything on a newly created volume (it has not even
been mounted once);
* I tried to increase server.event-threads to 5 and even 50 (just to be sure);
didn't change anything;

Details:

[root at rk1glust1 ~]# rpm -qa | grep gluster
glusterfs-client-xlators-3.12.1-2.el7.x86_64
glusterfs-api-3.12.1-2.el7.x86_64
glusterfs-libs-3.12.1-2.el7.x86_64
glusterfs-cli-3.12.1-2.el7.x86_64
glusterfs-3.12.1-2.el7.x86_64
glusterfs-fuse-3.12.1-2.el7.x86_64
glusterfs-coreutils-0.2.0-1.el7.x86_64
glusterfs-server-3.12.1-2.el7.x86_64

[root at rk1glust1 ~]# gluster volume create mysto replica 3
10.45.1.1:/mnt/dd1/brick 10.45.1.2:/mnt/dd1/brick 10.45.1.3:/mnt/dd1/brick
10.45.2.1:/mnt/dd1/brick 10.45.2.2:/mnt/dd1/brick 10.45.2.3:/mnt/dd1/brick
10.45.4.1:/mnt/dd1/brick 10.45.4.2:/mnt/dd1/brick 10.45.4.3:/mnt/dd1/brick
10.45.6.1:/mnt/dd1/brick 10.45.6.2:/mnt/dd1/brick 10.45.6.3:/mnt/dd1/brick
10.45.8.1:/mnt/dd1/brick 10.45.8.2:/mnt/dd1/brick 10.45.8.3:/mnt/dd1/brick
10.45.9.1:/mnt/dd1/brick 10.45.9.2:/mnt/dd1/brick 10.45.9.3:/mnt/dd1/brick
10.45.10.1:/mnt/dd1/brick 10.45.10.2:/mnt/dd1/brick 10.45.10.3:/mnt/dd1/brick
10.45.12.1:/mnt/dd1/brick 10.45.12.2:/mnt/dd1/brick 10.45.12.3:/mnt/dd1/brick
10.45.13.1:/mnt/dd1/brick 10.45.13.2:/mnt/dd1/brick 10.45.13.3:/mnt/dd1/brick
10.45.14.1:/mnt/dd1/brick 10.45.14.2:/mnt/dd1/brick 10.45.14.3:/mnt/dd1/brick
10.45.1.1:/mnt/dd2/brick 10.45.1.2:/mnt/dd2/brick 10.45.1.3:/mnt/dd2/brick
10.45.2.1:/mnt/dd2/brick 10.45.2.2:/mnt/dd2/brick 10.45.2.3:/mnt/dd2/brick
10.45.4.1:/mnt/dd2/brick 10.45.4.2:/mnt/dd2/brick 10.45.4.3:/mnt/dd2/brick
10.45.6.1:/mnt/dd2/brick 10.45.6.2:/mnt/dd2/brick 10.45.6.3:/mnt/dd2/brick
10.45.8.1:/mnt/dd2/brick 10.45.8.2:/mnt/dd2/brick 10.45.8.3:/mnt/dd2/brick
10.45.9.1:/mnt/dd2/brick 10.45.9.2:/mnt/dd2/brick 10.45.9.3:/mnt/dd2/brick
10.45.10.1:/mnt/dd2/brick 10.45.10.2:/mnt/dd2/brick 10.45.10.3:/mnt/dd2/brick
10.45.12.1:/mnt/dd2/brick 10.45.12.2:/mnt/dd2/brick 10.45.12.3:/mnt/dd2/brick
10.45.13.1:/mnt/dd2/brick 10.45.13.2:/mnt/dd2/brick 10.45.13.3:/mnt/dd2/brick
10.45.14.1:/mnt/dd2/brick 10.45.14.2:/mnt/dd2/brick 10.45.14.3:/mnt/dd2/brick
[root at rk1glust1 ~]# gluster volume start mysto

[root at rk4glust2 ~]# reboot
...

Example of ps; here only one glusterfsd (the one dealing with /mnt/dd1/brick)
died :

[root at rk4glust2 ~]# ps auxf  | grep [g]luster
root       932  1.1  0.1 667772 14560 ?        Ssl  07:41   0:12
/usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO
root       944  0.2  0.5 1136964 41380 ?       Ssl  07:41   0:02
/usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p
/var/run/gluster/glustershd/glustershd.pid -l /var/log/glusterfs/glustershd.log
-S /var/run/gluster/eefb7938899437a1c25e42edaef663ce.socket --xlator-option
*replicate*.node-uuid=d1dfb9d4-6c05-45be-b0eb-13298c0cf367
root       961  0.0  0.2 1032788 16956 ?       Ssl  07:41   0:00
/usr/sbin/glusterfsd -s 10.45.4.2 --volfile-id mysto.10.45.4.2.mnt-dd2-brick -p
/var/run/gluster/vols/mysto/10.45.4.2-mnt-dd2-brick.pid -S
/var/run/gluster/c5c5852563f2288eef42f52e6161b23b.socket --brick-name
/mnt/dd2/brick -l /var/log/glusterfs/bricks/mnt-dd2-brick.log --xlator-option
*-posix.glusterd-uuid=d1dfb9d4-6c05-45be-b0eb-13298c0cf367 --brick-port 49153
--xlator-option mysto-server.listen-port=49153

Example of logs when a process fails :

[root at rk4glust2 ~]# cat /var/log/glusterfs/bricks/mnt-dd1-brick.log 
[2017-10-25 12:16:09.947723] I [MSGID: 100030] [glusterfsd.c:2496:main]
0-/usr/sbin/glusterfsd: Started running /usr/sbin/glusterfsd version 3.12.1
(args: /usr/sbin/glusterfsd -s 10.45.4.2 --volfile-id
mysto.10.45.4.2.mnt-dd1-brick -p
/var/run/gluster/vols/mysto/10.45.4.2-mnt-dd1-brick.pid -S
/var/run/gluster/53ec230bdae6d472b0240b5b54a49ed7.socket --brick-name
/mnt/dd1/brick -l /var/log/glusterfs/bricks/mnt-dd1-brick.log --xlator-option
*-posix.glusterd-uuid=d1dfb9d4-6c05-45be-b0eb-13298c0cf367 --brick-port 49152
--xlator-option mysto-server.listen-port=49152)
[2017-10-25 12:16:09.953665] W [MSGID: 101002] [options.c:995:xl_opt_validate]
0-glusterfs: option 'address-family' is deprecated, preferred is
'transport.address-family', continuing with correction
[2017-10-25 12:16:09.957247] I [MSGID: 101190]
[event-epoll.c:613:event_dispatch_epoll_worker] 0-epoll: Started thread with
index 1
[2017-10-25 12:16:47.294272] W [socket.c:593:__socket_rwv] 0-glusterfs: readv
on 10.45.4.2:24007 failed (Connection reset by peer)
[2017-10-25 12:16:47.294441] I [glusterfsd-mgmt.c:2208:mgmt_rpc_notify]
0-glusterfsd-mgmt: disconnected from remote-host: 10.45.4.2
[2017-10-25 12:16:47.294454] I [glusterfsd-mgmt.c:2229:mgmt_rpc_notify]
0-glusterfsd-mgmt: Exhausted all volfile servers
[2017-10-25 12:16:47.294745] W [glusterfsd.c:1347:cleanup_and_exit]
(-->/lib64/libgfrpc.so.0(rpc_clnt_notify+0xab) [0x7f0d7722700b]
-->/usr/sbin/glusterfsd(+0x1126d) [0x7f0d7796726d]
-->/usr/sbin/glusterfsd(cleanup_and_exit+0x6b) [0x7f0d7796018b] ) 0-: received
signum (1), shutting down
[2017-10-25 12:16:47.294831] I [socket.c:3598:socket_submit_request]
0-glusterfs: not connected (priv->connected = -1)
[2017-10-25 12:16:47.294842] W [rpc-clnt.c:1679:rpc_clnt_submit] 0-glusterfs:
failed to submit rpc-request (XID: 0x3 Program: Gluster Portmap, ProgVers: 1,
Proc: 5) to rpc-transport (glusterfs)

[root at rk4glust2 ~]# cat /var/log/glusterfs/bricks/mnt-dd1-brick.log 
[2017-10-26 07:41:16.030488] I [MSGID: 100030] [glusterfsd.c:2496:main]
0-/usr/sbin/glusterfsd: Started running /usr/sbin/glusterfsd version 3.12.1
(args: /usr/sbin/glusterfsd -s 10.45.4.2 --volfile-id
mysto.10.45.4.2.mnt-dd1-brick -p
/var/run/gluster/vols/mysto/10.45.4.2-mnt-dd1-brick.pid -S
/var/run/gluster/53ec230bdae6d472b0240b5b54a49ed7.socket --brick-name
/mnt/dd1/brick -l /var/log/glusterfs/bricks/mnt-dd1-brick.log --xlator-option
*-posix.glusterd-uuid=d1dfb9d4-6c05-45be-b0eb-13298c0cf367 --brick-port 49152
--xlator-option mysto-server.listen-port=49152)
[2017-10-26 07:41:16.038526] W [MSGID: 101002] [options.c:995:xl_opt_validate]
0-glusterfs: option 'address-family' is deprecated, preferred is
'transport.address-family', continuing with correction
[2017-10-26 07:41:16.043011] I [MSGID: 101190]
[event-epoll.c:613:event_dispatch_epoll_worker] 0-epoll: Started thread with
index 1
[2017-10-26 07:43:23.262749] E [socket.c:2369:socket_connect_finish]
0-glusterfs: connection to 10.45.4.2:24007 failed (Connection timed out);
disconnecting socket
[2017-10-26 07:43:23.262951] I [glusterfsd-mgmt.c:2208:mgmt_rpc_notify]
0-glusterfsd-mgmt: disconnected from remote-host: 10.45.4.2
[2017-10-26 07:43:23.262995] I [glusterfsd-mgmt.c:2229:mgmt_rpc_notify]
0-glusterfsd-mgmt: Exhausted all volfile servers
[2017-10-26 07:43:23.263599] W [glusterfsd.c:1347:cleanup_and_exit]
(-->/lib64/libgfrpc.so.0(rpc_clnt_notify+0xab) [0x7fd1d735a00b]
-->/usr/sbin/glusterfsd(+0x1126d) [0x7fd1d7a9a26d]
-->/usr/sbin/glusterfsd(cleanup_and_exit+0x6b) [0x7fd1d7a9318b] ) 0-: received
signum (1), shutting down
[2017-10-26 07:43:23.270630] W [rpc-clnt.c:1679:rpc_clnt_submit] 0-glusterfs:
failed to submit rpc-request (XID: 0x2 Program: Gluster Portmap, ProgVers: 1,
Proc: 5) to rpc-transport (glusterfs)
[2017-10-26 07:43:23.271359] I [timer.c:198:gf_timer_registry_init]
(-->/lib64/libglusterfs.so.0(+0x39132) [0x7fd1d759f132]
-->/lib64/libgfrpc.so.0(rpc_clnt_reconnect+0xee) [0x7fd1d735954e]
-->/lib64/libglusterfs.so.0(gf_timer_call_after+0x213) [0x7fd1d759f4f3] )
0-timer: ctx cleanup started
[2017-10-26 07:43:23.271498] E [timer.c:44:gf_timer_call_after]
(-->/lib64/libglusterfs.so.0(+0x39132) [0x7fd1d759f132]
-->/lib64/libgfrpc.so.0(rpc_clnt_reconnect+0xee) [0x7fd1d735954e]
-->/lib64/libglusterfs.so.0(gf_timer_call_after+0x259) [0x7fd1d759f539] )
0-timer: !reg
[2017-10-26 07:43:23.271520] E [rpc-clnt.c:417:rpc_clnt_reconnect] 0-glusterfs:
Error adding to timer event queue

[root at rk4glust2 ~]# cat /var/log/glusterfs/glustershd.log 
[2017-10-25 12:11:30.950132] I [MSGID: 100030] [glusterfsd.c:2496:main]
0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 3.12.1
(args: /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p
/var/run/gluster/glustershd/glustershd.pid -l /var/log/glusterfs/glustershd.log
-S /var/run/gluster/eefb7938899437a1c25e42edaef663ce.socket --xlator-option
*replicate*.node-uuid=d1dfb9d4-6c05-45be-b0eb-13298c0cf367)
[2017-10-25 12:11:30.960965] W [MSGID: 101002] [options.c:995:xl_opt_validate]
0-glusterfs: option 'address-family' is deprecated, preferred is
'transport.address-family', continuing with correction
[2017-10-25 12:11:30.966860] I [MSGID: 101190]
[event-epoll.c:613:event_dispatch_epoll_worker] 0-epoll: Started thread with
index 1
[2017-10-25 12:12:06.268304] W [socket.c:593:__socket_rwv] 0-glusterfs: readv
on 127.0.0.1:24007 failed (Connection reset by peer)
[2017-10-25 12:12:06.268472] I [glusterfsd-mgmt.c:2208:mgmt_rpc_notify]
0-glusterfsd-mgmt: disconnected from remote-host: localhost
[2017-10-25 12:12:06.268501] I [glusterfsd-mgmt.c:2229:mgmt_rpc_notify]
0-glusterfsd-mgmt: Exhausted all volfile servers
[2017-10-25 12:12:06.268999] W [glusterfsd.c:1347:cleanup_and_exit]
(-->/lib64/libgfrpc.so.0(rpc_clnt_notify+0xab) [0x7f213832300b]
-->/usr/sbin/glusterfs(+0x1126d) [0x7f2138a6326d]
-->/usr/sbin/glusterfs(cleanup_and_exit+0x6b) [0x7f2138a5c18b] ) 0-: received
signum (1), shutting down

-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.