[Bugs] [Bug 1576927] New: Self heal daemon crash triggered by particular host

Thu May 10 18:21:06 UTC 2018

https://bugzilla.redhat.com/show_bug.cgi?id=1576927

            Bug ID: 1576927
           Summary: Self heal daemon crash triggered by particular host
           Product: GlusterFS
           Version: 4.0
         Component: selfheal
          Assignee: bugs at gluster.org
          Reporter: langlois.tyler at gmail.com
                CC: bugs at gluster.org

Description of problem:

When adding a peer to my cluster that manages disperse volumes of type 2 + 1,
there's some type of bad state that causes self-heal daemons to crash across
the cluster upon that peer joining the cluster group.

Version-Release number of selected component (if applicable):

[root at codex01 ~]# gluster --version
glusterfs 4.0.1

How reproducible:

With my current cluster state, every time.

Steps to Reproduce:

1. Starting with glusterd off and all volumes stopped, start glusterd on _two_
nodes.

2. State is now:
[root at codex01 ~]# gluster v status knox
Status of volume: knox
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick codex01:/srv/storage/disperse-2_1     49152     0          Y       16851
Brick codex02:/srv/storage/disperse-2_1     49152     0          Y       14029
Self-heal Daemon on localhost               N/A       N/A        Y       16873
Bitrot Daemon on localhost                  N/A       N/A        Y       16895
Scrubber Daemon on localhost                N/A       N/A        Y       16905
Self-heal Daemon on codex02                 N/A       N/A        Y       14051
Bitrot Daemon on codex02                    N/A       N/A        Y       14060
Scrubber Daemon on codex02                  N/A       N/A        Y       14070
[root at codex01 ~]# gluster v info knox
Volume Name: knox
Type: Disperse
Volume ID: bd295812-4a07-482f-9329-4cafbdf0ad28
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x (2 + 1) = 3
Transport-type: tcp
Bricks:
Brick1: codex01:/srv/storage/disperse-2_1
Brick2: codex02:/srv/storage/disperse-2_1
Brick3: codex03:/srv/storage/disperse-2_1-fixed
Options Reconfigured:
cluster.disperse-self-heal-daemon: enable
features.scrub: Active
features.bitrot: on
transport.address-family: inet
nfs.disable: on

3. Start glusterd on the third, potentially bad node: ssh root at codex03
systemctl start glusterd

4. This causes the self-heal daemon to crash on all nodes with the following in
/var/log/glusterfs/glustershd.log:
[2018-05-10 18:15:07.811324] W [rpc-clnt.c:1739:rpc_clnt_submit]
0-knox-client-2: error returned while attempting to connect to host:(null),
port:0
[2018-05-10 18:15:07.812089] W [rpc-clnt.c:1739:rpc_clnt_submit]
0-knox-client-2: error returned while attempting to connect to host:(null),
port:0
[2018-05-10 18:15:07.812822] I [rpc-clnt.c:2071:rpc_clnt_reconfig]
0-knox-client-2: changing port to 49152 (from 0)
[2018-05-10 18:15:07.820841] W [rpc-clnt.c:1739:rpc_clnt_submit]
0-knox-client-2: error returned while attempting to connect to host:(null),
port:0
[2018-05-10 18:15:07.821667] W [rpc-clnt.c:1739:rpc_clnt_submit]
0-knox-client-2: error returned while attempting to connect to host:(null),
port:0
[2018-05-10 18:15:07.825170] I [MSGID: 114046]
[client-handshake.c:1176:client_setvolume_cbk] 0-knox-client-2: Connected to
knox-client-2, attached to remote volume '/srv/storage/disperse-2_1-fixed'.
[2018-05-10 18:15:07.825458] W [MSGID: 101088]
[common-utils.c:4168:gf_backtrace_save] 0-knox-disperse-0: Failed to save the
backtrace.
The message "W [MSGID: 101088] [common-utils.c:4168:gf_backtrace_save]
0-knox-disperse-0: Failed to save the backtrace." repeated 50 times between
[2018-05-10 18:15:07.825458] and [2018-05-10 18:15:07.925122]
pending frames:
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
patchset: git://git.gluster.org/glusterfs.git
signal received: 6
time of crash:
2018-05-10 18:15:07
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 4.0.1
---------

5. New volune status:
[root at codex01 ~]# gluster v status knox
Status of volume: knox
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick codex01:/srv/storage/disperse-2_1     49152     0          Y       16851
Brick codex02:/srv/storage/disperse-2_1     49152     0          Y       14029
Brick codex03:/srv/storage/disperse-2_1-fix
ed                                          49152     0          Y       29985
Self-heal Daemon on localhost               N/A       N/A        N       N/A
Bitrot Daemon on localhost                  N/A       N/A        Y       16895
Scrubber Daemon on localhost                N/A       N/A        Y       16905
Self-heal Daemon on codex02                 N/A       N/A        N       N/A
Bitrot Daemon on codex02                    N/A       N/A        Y       14060
Scrubber Daemon on codex02                  N/A       N/A        Y       14070
Self-heal Daemon on codex03                 N/A       N/A        N       N/A
Bitrot Daemon on codex03                    N/A       N/A        Y       30033
Scrubber Daemon on codex03                  N/A       N/A        Y       30040

Task Status of Volume knox
------------------------------------------------------------------------------
There are no active volume tasks

Actual results:

Self-heal daemon crashes

Expected results:

Self-heal daemon shouldn't crash

Additional info:

I understand that this may be hard to reproduce as it's likely some sort of bad
state codex03 got into, but I didn't want to blow away the cluster in case
there was a particular case that I couldn't manage to reproduce again. This
occurs on _any_ volume - the daemon survives until a particular node joins and
brings down all self-heal daemons. 

I was directed here from IRC, but if this belongs more correctly in the mailing
list, I'm happy to move it over there.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.