[Bugs] [Bug 1263042] New: glusterfsd crash

bugzilla at redhat.com bugzilla at redhat.com
Tue Sep 15 02:29:36 UTC 2015


https://bugzilla.redhat.com/show_bug.cgi?id=1263042

            Bug ID: 1263042
           Summary: glusterfsd crash
           Product: GlusterFS
           Version: 3.4.2
         Component: glusterd
          Severity: urgent
          Assignee: bugs at gluster.org
          Reporter: sunkai0431 at gmail.com
                CC: bugs at gluster.org, gluster-bugs at redhat.com



Description of problem:

We have 8 servers under this gluster cluster, each two as a brick, when
glusterd in 172.16.161.5 start, no matter cluster.self-heal-daemon on or off,
the other servers will hang at df -h which mount this gluster.But when kill all
the gluster processes in 172.16.161.5, the whole gluster is accessable. Also
quiet a lot of zombie processes exit on the saying server:

#ps aux | grep Z | wc -l
641

#ps aux | grep Z | head
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root       301  0.0  0.0      0     0 ?        Z    09:10   0:00 [sh] <defunct>
root       327  0.0  0.0      0     0 ?        Z    08:45   0:00 [sh] <defunct>
root       350  0.0  0.0      0     0 ?        Z    09:10   0:00 [sh] <defunct>
root       431  0.0  0.0      0     0 ?        Z    09:10   0:00 [sh] <defunct>
root       478  0.0  0.0      0     0 ?        Z    09:10   0:00 [sh] <defunct>
root       524  0.0  0.0      0     0 ?        Z    08:45   0:00 [sh] <defunct>
root       526  0.0  0.0      0     0 ?        Z    09:10   0:00 [sh] <defunct>
root       573  0.0  0.0      0     0 ?        Z    09:10   0:00 [sh] <defunct>
root       663  0.0  0.0      0     0 ?        Z    09:10   0:00 [sh] <defunct>


Version-Release number of selected component (if applicable):

#gluster --version
glusterfs 3.4.2 built on Nov  6 2014 14:14:26
Repository revision: git://git.gluster.com/glusterfs.git
Copyright (c) 2006-2011 Gluster Inc. <http://www.gluster.com>
GlusterFS comes with ABSOLUTELY NO WARRANTY.
You may redistribute copies of GlusterFS under the terms of the GNU General
Public License.

How reproducible:


Steps to Reproduce:
1. Create a zpool under raidz and mount to /mnt/zpool, zfs create zpool/zfs,
zfs set xattr=sa zpool/zfs
2. Stop cluster.self-heal-daemon on a nornal node
3. grep volume-id /var/lib/glusterd/vols/storage_1/info  | cut -d= -f2 | sed
's/-//g',
   setfattr -n trusted.glusterfs.volume-id -v
0x3587ec7fa7574b8b8f02244c5eddf16c /mnt/zpool/zfs
4. Start /etc/init.d/glusterd
5. Start cluster.self-heal-daemon

Actual results:
glusterd crashed

#gluster volume heal  storage_1 info
Connection failed. Please check if gluster daemon is operational.

#gluster volume status
Status of volume: storage_1
Gluster process                     Port    Online  Pid
------------------------------------------------------------------------------
Brick 172.16.161.10:/mnt/zpool/zfs          49152   Y   31628
Brick 172.16.161.3:/mnt/zpool/zfs           49152   Y   689
Brick 172.16.161.4:/mnt/zpool/zfs           49153   Y   29349
Brick 172.16.161.5:/mnt/zpool/zfs           49154   Y   17987
Brick 172.16.161.6:/mnt/zpool/zfs           49152   Y   13826
Brick 172.16.161.7:/mnt/zpool/zfs           49152   Y   28246
Brick 172.16.161.8:/mnt/zpool/zfs           49152   Y   21390
Brick 172.16.161.9:/mnt/zpool/zfs           49152   Y   24121
NFS Server on localhost                 2049    Y   24470
Self-heal Daemon on localhost               N/A Y   24477
NFS Server on 172.16.161.4              2049    Y   6262
Self-heal Daemon on 172.16.161.4            N/A Y   6270
NFS Server on 172.16.161.3              2049    Y   21079
Self-heal Daemon on 172.16.161.3            N/A Y   21086
NFS Server on 172.16.161.8              2049    Y   32357
Self-heal Daemon on 172.16.161.8            N/A Y   32390
NFS Server on 172.16.161.10             2049    Y   8899
Self-heal Daemon on 172.16.161.10           N/A Y   8915
NFS Server on 172.16.161.7              2049    Y   5978
Self-heal Daemon on 172.16.161.7            N/A Y   5985
NFS Server on 172.16.161.9              2049    Y   1727
Self-heal Daemon on 172.16.161.9            N/A Y   1734
NFS Server on 172.16.161.5              2049    Y   12371
Self-heal Daemon on 172.16.161.5            N/A Y   12375

#gluster volume info

Volume Name: storage_1
Type: Distributed-Replicate
Volume ID: 3587ec7f-a757-4b8b-8f02-244c5eddf16c
Status: Started
Number of Bricks: 4 x 2 = 8
Transport-type: tcp
Bricks:
Brick1: 172.16.161.10:/mnt/zpool/zfs
Brick2: 172.16.161.3:/mnt/zpool/zfs
Brick3: 172.16.161.4:/mnt/zpool/zfs
Brick4: 172.16.161.5:/mnt/zpool/zfs
Brick5: 172.16.161.6:/mnt/zpool/zfs
Brick6: 172.16.161.7:/mnt/zpool/zfs
Brick7: 172.16.161.8:/mnt/zpool/zfs
Brick8: 172.16.161.9:/mnt/zpool/zfs
Options Reconfigured:
cluster.self-heal-daemon: on
performance.flush-behind: off
cluster.min-free-disk: 50GB
nfs.port: 2049

#glustershd.log
[2015-09-14 14:25:16.727491] I
[client-handshake.c:1659:select_server_supported_programs]
0-storage_1-client-1: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2015-09-14 14:25:16.727550] I
[client-handshake.c:1659:select_server_supported_programs]
0-storage_1-client-7: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2015-09-14 14:25:16.727602] I
[client-handshake.c:1659:select_server_supported_programs]
0-storage_1-client-4: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2015-09-14 14:25:16.727669] I
[client-handshake.c:1659:select_server_supported_programs]
0-storage_1-client-5: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2015-09-14 14:25:16.727729] I
[client-handshake.c:1659:select_server_supported_programs]
0-storage_1-client-6: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2015-09-14 14:25:16.727798] I [client-handshake.c:1456:client_setvolume_cbk]
0-storage_1-client-0: Connected to 172.16.161.10:49152, attached to remote
volume '/mnt/zpool/zfs'.
[2015-09-14 14:25:16.727814] I [client-handshake.c:1468:client_setvolume_cbk]
0-storage_1-client-0: Server and Client lk-version numbers are not same,
reopening the fds
[2015-09-14 14:25:16.727880] I [afr-common.c:3698:afr_notify]
0-storage_1-replicate-0: Subvolume 'storage_1-client-0' came back up; going
online.
[2015-09-14 14:25:16.728293] I [client-handshake.c:1456:client_setvolume_cbk]
0-storage_1-client-7: Connected to 172.16.161.9:49152, attached to remote
volume '/mnt/zpool/zfs'.
[2015-09-14 14:25:16.728313] I [client-handshake.c:1468:client_setvolume_cbk]
0-storage_1-client-7: Server and Client lk-version numbers are not same,
reopening the fds
[2015-09-14 14:25:16.728363] I [afr-common.c:3698:afr_notify]
0-storage_1-replicate-3: Subvolume 'storage_1-client-7' came back up; going
online.
[2015-09-14 14:25:16.728432] I [client-handshake.c:1456:client_setvolume_cbk]
0-storage_1-client-4: Connected to 172.16.161.6:49152, attached to remote
volume '/mnt/zpool/zfs'.
[2015-09-14 14:25:16.728449] I [client-handshake.c:1468:client_setvolume_cbk]
0-storage_1-client-4: Server and Client lk-version numbers are not same,
reopening the fds
[2015-09-14 14:25:16.728494] I [afr-common.c:3698:afr_notify]
0-storage_1-replicate-2: Subvolume 'storage_1-client-4' came back up; going
online.
[2015-09-14 14:25:16.728561] I [client-handshake.c:1456:client_setvolume_cbk]
0-storage_1-client-5: Connected to 172.16.161.7:49152, attached to remote
volume '/mnt/zpool/zfs'.
[2015-09-14 14:25:16.728590] I [client-handshake.c:1468:client_setvolume_cbk]
0-storage_1-client-5: Server and Client lk-version numbers are not same,
reopening the fds
[2015-09-14 14:25:16.728706] I [client-handshake.c:1456:client_setvolume_cbk]
0-storage_1-client-6: Connected to 172.16.161.8:49152, attached to remote
volume '/mnt/zpool/zfs'.
[2015-09-14 14:25:16.728732] I [client-handshake.c:1468:client_setvolume_cbk]
0-storage_1-client-6: Server and Client lk-version numbers are not same,
reopening the fds
[2015-09-14 14:25:16.728828] I
[client-handshake.c:450:client_set_lk_version_cbk] 0-storage_1-client-0: Server
lk version = 1
[2015-09-14 14:25:16.728862] I [client-handshake.c:1456:client_setvolume_cbk]
0-storage_1-client-3: Connected to 172.16.161.5:49154, attached to remote
volume '/mnt/zpool/zfs'.
[2015-09-14 14:25:16.728879] I [client-handshake.c:1468:client_setvolume_cbk]
0-storage_1-client-3: Server and Client lk-version numbers are not same,
reopening the fds
[2015-09-14 14:25:16.728931] I [afr-common.c:3698:afr_notify]
0-storage_1-replicate-1: Subvolume 'storage_1-client-3' came back up; going
online.
[2015-09-14 14:25:16.728990] I [client-handshake.c:1456:client_setvolume_cbk]
0-storage_1-client-2: Connected to 172.16.161.4:49153, attached to remote
volume '/mnt/zpool/zfs'.
[2015-09-14 14:25:16.729005] I [client-handshake.c:1468:client_setvolume_cbk]
0-storage_1-client-2: Server and Client lk-version numbers are not same,
reopening the fds
[2015-09-14 14:25:16.729092] I [client-handshake.c:1456:client_setvolume_cbk]
0-storage_1-client-1: Connected to 172.16.161.3:49152, attached to remote
volume '/mnt/zpool/zfs'.
[2015-09-14 14:25:16.729108] I [client-handshake.c:1468:client_setvolume_cbk]
0-storage_1-client-1: Server and Client lk-version numbers are not same,
reopening the fds
[2015-09-14 14:25:16.729191] I
[client-handshake.c:450:client_set_lk_version_cbk] 0-storage_1-client-7: Server
lk version = 1
[2015-09-14 14:25:16.729216] I
[client-handshake.c:450:client_set_lk_version_cbk] 0-storage_1-client-4: Server
lk version = 1
[2015-09-14 14:25:16.729235] I
[client-handshake.c:450:client_set_lk_version_cbk] 0-storage_1-client-5: Server
lk version = 1
[2015-09-14 14:25:16.729254] I
[client-handshake.c:450:client_set_lk_version_cbk] 0-storage_1-client-6: Server
lk version = 1
[2015-09-14 14:25:16.729281] I
[client-handshake.c:450:client_set_lk_version_cbk] 0-storage_1-client-3: Server
lk version = 1
[2015-09-14 14:25:16.729362] I
[client-handshake.c:450:client_set_lk_version_cbk] 0-storage_1-client-2: Server
lk version = 1
[2015-09-14 14:25:16.729390] I
[client-handshake.c:450:client_set_lk_version_cbk] 0-storage_1-client-1: Server
lk version = 1
[2015-09-14 14:25:16.900936] I [afr-self-heald.c:1180:afr_dir_exclusive_crawl]
0-storage_1-replicate-1: Another crawl is in progress for storage_1-client-3
[2015-09-14 14:25:17.095066] I [afr-self-heald.c:1180:afr_dir_exclusive_crawl]
0-storage_1-replicate-1: Another crawl is in progress for storage_1-client-3
[2015-09-14 14:27:35.767127] W [socket.c:514:__socket_rwv] 0-glusterfs: readv
failed (No data available)
[2015-09-14 14:27:35.767175] W [socket.c:1962:__socket_proto_state_machine]
0-glusterfs: reading from socket failed. Error (No data available), peer
(127.0.0.1:24007)
[2015-09-14 14:27:45.815724] E [socket.c:2157:socket_connect_finish]
0-glusterfs: connection to 127.0.0.1:24007 failed (Connection refused)
[2015-09-14 14:27:45.815785] W [socket.c:514:__socket_rwv] 0-glusterfs: readv
failed (No data available)
[2015-09-14 14:27:48.831108] W [socket.c:514:__socket_rwv] 0-glusterfs: readv
failed (No data available)
[2015-09-14 14:27:51.835174] W [socket.c:514:__socket_rwv] 0-glusterfs: readv
failed (No data available)
[2015-09-14 14:27:54.845877] W [socket.c:514:__socket_rwv] 0-glusterfs: readv
failed (No data available)
[2015-09-14 14:27:57.854196] W [socket.c:514:__socket_rwv] 0-glusterfs: readv
failed (No data available)
[2015-09-14 14:28:00.869561] W [socket.c:514:__socket_rwv] 0-glusterfs: readv
failed (No data available)
[2015-09-14 14:28:03.877629] W [socket.c:514:__socket_rwv] 0-glusterfs: readv
failed (No data available)
[2015-09-14 14:28:06.893191] W [socket.c:514:__socket_rwv] 0-glusterfs: readv
failed (No data available)
[2015-09-14 14:28:09.899443] W [socket.c:514:__socket_rwv] 0-glusterfs: readv
failed (No data available)
[2015-09-14 14:28:12.911237] W [socket.c:514:__socket_rwv] 0-glusterfs: readv
failed (No data available)
[2015-09-14 14:28:15.916225] W [socket.c:514:__socket_rwv] 0-glusterfs: readv
failed (No data available)
[2015-09-14 14:28:18.928260] W [socket.c:514:__socket_rwv] 0-glusterfs: readv
failed (No data available)
[2015-09-14 14:28:21.934446] W [socket.c:514:__socket_rwv] 0-glusterfs: readv
failed (No data available)
[2015-09-14 14:28:24.948352] W [socket.c:514:__socket_rwv] 0-glusterfs: readv
failed (No data available)
[2015-09-14 14:28:27.954530] W [socket.c:514:__socket_rwv] 0-glusterfs: readv
failed (No data available)
[2015-09-14 14:28:30.969954] W [socket.c:514:__socket_rwv] 0-glusterfs: readv
failed (No data available)
[2015-09-14 14:28:33.976168] W [socket.c:514:__socket_rwv] 0-glusterfs: readv
failed (No data available)
[2015-09-14 14:28:36.986261] W [socket.c:514:__socket_rwv] 0-glusterfs: readv
failed (No data available)
[2015-09-14 14:28:39.992428] W [socket.c:514:__socket_rwv] 0-glusterfs: readv
failed (No data available)
[2015-09-14 14:28:43.003749] W [socket.c:514:__socket_rwv] 0-glusterfs: readv
failed (No data available)
[2015-09-14 14:28:46.009975] W [socket.c:514:__socket_rwv] 0-glusterfs: readv
failed (No data available)
[2015-09-14 14:28:49.019259] W [socket.c:514:__socket_rwv] 0-glusterfs: readv
failed (No data available)
[2015-09-14 14:28:52.025489] W [socket.c:514:__socket_rwv] 0-glusterfs: readv
failed (No data available)
[2015-09-14 14:28:55.037455] W [socket.c:514:__socket_rwv] 0-glusterfs: readv
failed (No data available)
[2015-09-14 14:28:58.045264] W [socket.c:514:__socket_rwv] 0-glusterfs: readv
failed (No data available)
[2015-09-14 14:29:01.055652] W [socket.c:514:__socket_rwv] 0-glusterfs: readv
failed (No data available)
[2015-09-14 14:29:04.068772] W [socket.c:514:__socket_rwv] 0-glusterfs: readv
failed (No data available)
[2015-09-14 14:29:07.081162] W [socket.c:514:__socket_rwv] 0-glusterfs: readv
failed (No data available)
[2015-09-14 14:29:10.085722] W [socket.c:514:__socket_rwv] 0-glusterfs: readv
failed (No data available)
[2015-09-14 14:29:13.096022] W [socket.c:514:__socket_rwv] 0-glusterfs: readv
failed (No data available)
[2015-09-14 14:29:16.102167] W [socket.c:514:__socket_rwv] 0-glusterfs: readv
failed (No data available)
[2015-09-14 14:29:19.113583] W [socket.c:514:__socket_rwv] 0-glusterfs: readv
failed (No data available)
[2015-09-14 14:29:22.119886] W [socket.c:514:__socket_rwv] 0-glusterfs: readv
failed (No data available)
[2015-09-14 14:29:25.134581] W [socket.c:514:__socket_rwv] 0-glusterfs: readv
failed (No data available)
[2015-09-14 14:29:28.138851] W [socket.c:514:__socket_rwv] 0-glusterfs: readv
failed (No data available)
[2015-09-14 14:29:29.138927] C
[client-handshake.c:127:rpc_client_ping_timer_expired] 0-storage_1-client-3:
server 172.16.161.5:49154 has not responded in the last 42 seconds, disco
nnecting.
[2015-09-14 14:29:29.143105] W [socket.c:514:__socket_rwv]
0-storage_1-client-3: readv failed (No data available)
[2015-09-14 14:29:29.144516] E [rpc-clnt.c:368:saved_frames_unwind]
(-->/usr/lib/libgfrpc.so.0(rpc_clnt_notify+0x13d) [0x3d5ca0ea5d]
(-->/usr/lib/libgfrpc.so.0(rpc_clnt_connection_cleanup+0
xc3) [0x3d5ca0e5c3] (-->/usr/lib/libgfrpc.so.0(saved_frames_destroy+0xe)
[0x3d5ca0e4de]))) 0-storage_1-client-3: forced unwinding frame type(GlusterFS
3.3) op(XATTROP(33)) called at
 2015-09-14 14:25:17.573714 (xid=0x18x)
[2015-09-14 14:29:29.144544] W [client-rpc-fops.c:1755:client3_3_xattrop_cbk]
0-storage_1-client-3: remote operation failed: Success. Path: (null) (--)
[2015-09-14 14:29:29.154316] I [socket.c:3027:socket_submit_request]
0-storage_1-client-3: not connected (priv->connected = 0)
[2015-09-14 14:29:29.154343] W [rpc-clnt.c:1488:rpc_clnt_submit]
0-storage_1-client-3: failed to submit rpc-request (XID: 0x24x Program:
GlusterFS 3.3, ProgVers: 330, Proc: 29) to r
pc-transport (storage_1-client-3)
[2015-09-14 14:29:29.154362] W [client-rpc-fops.c:1538:client3_3_inodelk_cbk]
0-storage_1-client-3: remote operation failed: Transport endpoint is not
connected
[2015-09-14 14:29:29.154407] E [rpc-clnt.c:368:saved_frames_unwind]
(-->/usr/lib/libgfrpc.so.0(rpc_clnt_notify+0x13d) [0x3d5ca0ea5d]
(-->/usr/lib/libgfrpc.so.0(rpc_clnt_connection_cleanup+0
xc3) [0x3d5ca0e5c3] (-->/usr/lib/libgfrpc.so.0(saved_frames_destroy+0xe)
[0x3d5ca0e4de]))) 0-storage_1-client-3: forced unwinding frame type(GlusterFS
Handshake) op(PING(3)) called
at 2015-09-14 14:28:47.010089 (xid=0x23x)
[2015-09-14 14:29:29.154418] W [client-handshake.c:276:client_ping_cbk]
0-storage_1-client-3: timer must have expired
[2015-09-14 14:29:29.154433] I [client.c:2097:client_rpc_notify]
0-storage_1-client-3: disconnected
[2015-09-14 14:29:29.154478] E [socket.c:2157:socket_connect_finish]
0-storage_1-client-3: connection to 172.16.161.5:24007 failed (Connection
refused)
[2015-09-14 14:29:29.154499] W [socket.c:514:__socket_rwv]
0-storage_1-client-3: readv failed (No data available)
[2015-09-14 14:29:29.154572] W [client-rpc-fops.c:1640:client3_3_entrylk_cbk]
0-storage_1-client-3: remote operation failed: Transport endpoint is not
connected
[2015-09-14 14:29:29.155101] E
[afr-self-heal-entry.c:2296:afr_sh_post_nonblocking_entry_cbk]
0-storage_1-replicate-1: Non Blocking entrylks failed for
<gfid:d529ffe7-48c7-4b6d-b9d3
-a645fc18b180>.
[2015-09-14 14:29:29.155289] W [client-rpc-fops.c:1112:client3_3_getxattr_cbk]
0-storage_1-client-3: remote operation failed: Transport endpoint is not
connected. Path: <gfid:d529ff
e7-48c7-4b6d-b9d3-a645fc18b180> (00000000-0000-0000-0000-000000000000). Key:
glusterfs.gfid2path
[2015-09-14 14:29:29.155383] W [client-rpc-fops.c:2265:client3_3_readdir_cbk]
0-storage_1-client-3: remote operation failed: Transport endpoint is not
connected remote_fd = -2
[2015-09-14 14:29:31.154245] W [socket.c:514:__socket_rwv] 0-glusterfs: readv
failed (No data available)
[2015-09-14 14:29:34.163264] W [socket.c:514:__socket_rwv] 0-glusterfs: readv
failed (No data available)
[2015-09-14 14:29:37.172062] W [socket.c:514:__socket_rwv] 0-glusterfs: readv
failed (No data available)
[2015-09-14 14:29:39.176124] W [socket.c:514:__socket_rwv]
0-storage_1-client-3: readv failed (No data available)
[2015-09-14 14:29:40.182083] W [socket.c:514:__socket_rwv] 0-glusterfs: readv
failed (No data available)
[2015-09-14 14:29:42.186819] W [socket.c:514:__socket_rwv]
0-storage_1-client-3: readv failed (No data available)
[2015-09-14 14:29:43.198223] W [socket.c:514:__socket_rwv] 0-glusterfs: readv
failed (No data available)
[2015-09-14 14:29:45.204365] W [socket.c:514:__socket_rwv]
0-storage_1-client-3: readv failed (No data available)
[2015-09-14 14:29:46.210179] W [socket.c:514:__socket_rwv] 0-glusterfs: readv
failed (No data available)
[2015-09-14 14:29:48.215011] W [socket.c:514:__socket_rwv]
0-storage_1-client-3: readv failed (No data available)
[2015-09-14 14:29:49.225163] W [socket.c:514:__socket_rwv] 0-glusterfs: readv
failed (No data available)


Expected results:
gluster will start self-heal at full speed

Additional info:

-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.


More information about the Bugs mailing list