[Bugs] [Bug 1576927] New: Self heal daemon crash triggered by particular host
bugzilla at redhat.com
bugzilla at redhat.com
Thu May 10 18:21:06 UTC 2018
https://bugzilla.redhat.com/show_bug.cgi?id=1576927
Bug ID: 1576927
Summary: Self heal daemon crash triggered by particular host
Product: GlusterFS
Version: 4.0
Component: selfheal
Assignee: bugs at gluster.org
Reporter: langlois.tyler at gmail.com
CC: bugs at gluster.org
Description of problem:
When adding a peer to my cluster that manages disperse volumes of type 2 + 1,
there's some type of bad state that causes self-heal daemons to crash across
the cluster upon that peer joining the cluster group.
Version-Release number of selected component (if applicable):
[root at codex01 ~]# gluster --version
glusterfs 4.0.1
How reproducible:
With my current cluster state, every time.
Steps to Reproduce:
1. Starting with glusterd off and all volumes stopped, start glusterd on _two_
nodes.
2. State is now:
[root at codex01 ~]# gluster v status knox
Status of volume: knox
Gluster process TCP Port RDMA Port Online Pid
------------------------------------------------------------------------------
Brick codex01:/srv/storage/disperse-2_1 49152 0 Y 16851
Brick codex02:/srv/storage/disperse-2_1 49152 0 Y 14029
Self-heal Daemon on localhost N/A N/A Y 16873
Bitrot Daemon on localhost N/A N/A Y 16895
Scrubber Daemon on localhost N/A N/A Y 16905
Self-heal Daemon on codex02 N/A N/A Y 14051
Bitrot Daemon on codex02 N/A N/A Y 14060
Scrubber Daemon on codex02 N/A N/A Y 14070
[root at codex01 ~]# gluster v info knox
Volume Name: knox
Type: Disperse
Volume ID: bd295812-4a07-482f-9329-4cafbdf0ad28
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x (2 + 1) = 3
Transport-type: tcp
Bricks:
Brick1: codex01:/srv/storage/disperse-2_1
Brick2: codex02:/srv/storage/disperse-2_1
Brick3: codex03:/srv/storage/disperse-2_1-fixed
Options Reconfigured:
cluster.disperse-self-heal-daemon: enable
features.scrub: Active
features.bitrot: on
transport.address-family: inet
nfs.disable: on
3. Start glusterd on the third, potentially bad node: ssh root at codex03
systemctl start glusterd
4. This causes the self-heal daemon to crash on all nodes with the following in
/var/log/glusterfs/glustershd.log:
[2018-05-10 18:15:07.811324] W [rpc-clnt.c:1739:rpc_clnt_submit]
0-knox-client-2: error returned while attempting to connect to host:(null),
port:0
[2018-05-10 18:15:07.812089] W [rpc-clnt.c:1739:rpc_clnt_submit]
0-knox-client-2: error returned while attempting to connect to host:(null),
port:0
[2018-05-10 18:15:07.812822] I [rpc-clnt.c:2071:rpc_clnt_reconfig]
0-knox-client-2: changing port to 49152 (from 0)
[2018-05-10 18:15:07.820841] W [rpc-clnt.c:1739:rpc_clnt_submit]
0-knox-client-2: error returned while attempting to connect to host:(null),
port:0
[2018-05-10 18:15:07.821667] W [rpc-clnt.c:1739:rpc_clnt_submit]
0-knox-client-2: error returned while attempting to connect to host:(null),
port:0
[2018-05-10 18:15:07.825170] I [MSGID: 114046]
[client-handshake.c:1176:client_setvolume_cbk] 0-knox-client-2: Connected to
knox-client-2, attached to remote volume '/srv/storage/disperse-2_1-fixed'.
[2018-05-10 18:15:07.825458] W [MSGID: 101088]
[common-utils.c:4168:gf_backtrace_save] 0-knox-disperse-0: Failed to save the
backtrace.
The message "W [MSGID: 101088] [common-utils.c:4168:gf_backtrace_save]
0-knox-disperse-0: Failed to save the backtrace." repeated 50 times between
[2018-05-10 18:15:07.825458] and [2018-05-10 18:15:07.925122]
pending frames:
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
patchset: git://git.gluster.org/glusterfs.git
signal received: 6
time of crash:
2018-05-10 18:15:07
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 4.0.1
---------
5. New volune status:
[root at codex01 ~]# gluster v status knox
Status of volume: knox
Gluster process TCP Port RDMA Port Online Pid
------------------------------------------------------------------------------
Brick codex01:/srv/storage/disperse-2_1 49152 0 Y 16851
Brick codex02:/srv/storage/disperse-2_1 49152 0 Y 14029
Brick codex03:/srv/storage/disperse-2_1-fix
ed 49152 0 Y 29985
Self-heal Daemon on localhost N/A N/A N N/A
Bitrot Daemon on localhost N/A N/A Y 16895
Scrubber Daemon on localhost N/A N/A Y 16905
Self-heal Daemon on codex02 N/A N/A N N/A
Bitrot Daemon on codex02 N/A N/A Y 14060
Scrubber Daemon on codex02 N/A N/A Y 14070
Self-heal Daemon on codex03 N/A N/A N N/A
Bitrot Daemon on codex03 N/A N/A Y 30033
Scrubber Daemon on codex03 N/A N/A Y 30040
Task Status of Volume knox
------------------------------------------------------------------------------
There are no active volume tasks
Actual results:
Self-heal daemon crashes
Expected results:
Self-heal daemon shouldn't crash
Additional info:
I understand that this may be hard to reproduce as it's likely some sort of bad
state codex03 got into, but I didn't want to blow away the cluster in case
there was a particular case that I couldn't manage to reproduce again. This
occurs on _any_ volume - the daemon survives until a particular node joins and
brings down all self-heal daemons.
I was directed here from IRC, but if this belongs more correctly in the mailing
list, I'm happy to move it over there.
--
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.
More information about the Bugs
mailing list