[Gluster-users] Gluster volume brick keeps going offline
Atin Mukherjee
amukherj at redhat.com
Fri Mar 20 04:57:22 UTC 2015
I see there is a crash in the brick log.
patchset: git://git.gluster.com/glusterfs.git
signal received: 11
time of crash: 2015-03-19 06:00:35configuration details:
argp 1
backtrace 1
dlfcn 1
fdatasync 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.5.0
/lib/x86_64-linux-gnu/libc.so.6(+0x321e0)[0x7f027c7031e0]
/usr/lib/x86_64-linux-gnu/glusterfs/3.5.0/xlator/features/locks.so(__get_entrylk_count+0x40)[0x7f0277fc5d70]
/usr/lib/x86_64-linux-gnu/glusterfs/3.5.0/xlator/features/locks.so(get_entrylk_count+0x4d)[0x7f0277fc5ddd]
/usr/lib/x86_64-linux-gnu/glusterfs/3.5.0/xlator/features/locks.so(pl_entrylk_xattr_fill+0x19)[0x7f0277fc2df9]
/usr/lib/x86_64-linux-gnu/glusterfs/3.5.0/xlator/features/locks.so(pl_lookup_cbk+0x1d0)[0x7f0277fc3390]
/usr/lib/x86_64-linux-gnu/glusterfs/3.5.0/xlator/features/access-control.so(posix_acl_lookup_cbk+0x12b)[0x7f02781d91fb]
/usr/lib/x86_64-linux-gnu/glusterfs/3.5.0/xlator/storage/posix.so(posix_lookup+0x331)[0x7f02788046c1]
/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(default_lookup+0x70)[0x7f027d6d1270]
/usr/lib/x86_64-linux-gnu/glusterfs/3.5.0/xlator/features/access-control.so(posix_acl_lookup+0x1b5)[0x7f02781d72f5]
/usr/lib/x86_64-linux-gnu/glusterfs/3.5.0/xlator/features/locks.so(pl_lookup+0x211)[0x7f0277fbd391]
/usr/lib/x86_64-linux-gnu/glusterfs/3.5.0/xlator/performance/io-threads.so(iot_lookup_wrapper+0x140)[0x7f0277da82d0]
/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(call_resume+0x126)[0x7f027d6e5f16]
/usr/lib/x86_64-linux-gnu/glusterfs/3.5.0/xlator/performance/io-threads.so(iot_worker+0x13e)[0x7f0277da86be]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x6b50)[0x7f027ce5bb50]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7f027c7ad70d]
Pranith/Ravi,
Could you help Kaamesh for it?
Also at the glusterd side I see there is some RPC related failures
(Probably a corruption). What's the gluster version are you using? Is
there any surprising logs at the other node?
~Atin
On 03/19/2015 12:58 PM, Kaamesh Kamalaaharan wrote:
> Sorry, forgot to include the attachment
>
> Thank You Kindly,
> Kaamesh
> Bioinformatician
> Novocraft Technologies Sdn Bhd
> C-23A-05, 3 Two Square, Section 19, 46300 Petaling Jaya
> Selangor Darul Ehsan
> Malaysia
> Mobile: +60176562635
> Ph: +60379600541
> Fax: +60379600540
>
> On Thu, Mar 19, 2015 at 2:40 PM, Kaamesh Kamalaaharan <kaamesh at novocraft.com
>> wrote:
>
>> Hi Atin, Thanks for the reply. Im not sure which logs are relevant so ill
>> just attach them all in a gz file.
>>
>> I ran a sudo gluster volume start gfsvolume force at 2015-03-19 05:49
>> i hope this helps.
>>
>> Thank You Kindly,
>> Kaamesh
>>
>> On Sun, Mar 15, 2015 at 11:41 PM, Atin Mukherjee <amukherj at redhat.com>
>> wrote:
>>
>>> Could you attach the logs for the analysis?
>>>
>>> ~Atin
>>>
>>> On 03/13/2015 03:29 PM, Kaamesh Kamalaaharan wrote:
>>>> Hi guys. Ive been using gluster for a while now and despite a few
>>> hiccups,
>>>> i find its a great system to use. One of my more persistent hiccups is
>>> an
>>>> issue with one brick going offline.
>>>>
>>>> My setup is a 2 brick 2 node setup. my main brick is gfs1 which has not
>>>> given me any problem. gfs2 however keeps going offline. Following
>>>> http://www.gluster.org/pipermail/gluster-users/2014-June/017583.html
>>>> temporarily fixed the error but the brick goes offline within the hour.
>>>>
>>>> This is what i get from my volume status command :
>>>>
>>>> sudo gluster volume status
>>>>>
>>>>> Status of volume: gfsvolume
>>>>> Gluster process Port Online Pid
>>>>>
>>>>>
>>> ------------------------------------------------------------------------------
>>>>> Brick gfs1:/export/sda/brick 49153 Y 9760
>>>>> Brick gfs2:/export/sda/brick N/A N 13461
>>>>> NFS Server on localhost 2049 Y 13473
>>>>> Self-heal Daemon on localhost N/A Y 13480
>>>>> NFS Server on gfs1 2049 Y 16166
>>>>> Self-heal Daemon on gfs1 N/A Y 16173
>>>>>
>>>>> Task Status of Volume gfsvolume
>>>>>
>>>>>
>>> ------------------------------------------------------------------------------
>>>>> There are no active volume tasks
>>>>>
>>>>>
>>>> doing sudo gluster volume start gfsvolume force gives me this:
>>>>
>>>> sudo gluster volume status
>>>>>
>>>>> Status of volume: gfsvolume
>>>>> Gluster process Port Online Pid
>>>>>
>>>>>
>>> ------------------------------------------------------------------------------
>>>>> Brick gfs1:/export/sda/brick 49153 Y 9760
>>>>> Brick gfs2:/export/sda/brick 49153 Y 13461
>>>>> NFS Server on localhost 2049 Y 13473
>>>>> Self-heal Daemon on localhost N/A Y 13480
>>>>> NFS Server on gfs1 2049 Y 16166
>>>>> Self-heal Daemon on gfs1 N/A Y 16173
>>>>>
>>>>> Task Status of Volume gfsvolume
>>>>>
>>>>>
>>> ------------------------------------------------------------------------------
>>>>> There are no active volume tasks
>>>>>
>>>>> half an hour later and my brick goes down again.
>>>>
>>>>>
>>>>>
>>>>> This is my glustershd.log. I snipped it because the rest of the log is
>>> a
>>>> repeat of the same error
>>>>
>>>>
>>>>>
>>>>> [2015-03-13 02:09:41.951556] I [glusterfsd.c:1959:main]
>>>>> 0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version
>>> 3.5.0
>>>>> (/usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p
>>>>> /var/lib/glus
>>>>> terd/glustershd/run/glustershd.pid -l
>>> /var/log/glusterfs/glustershd.log -S
>>>>> /var/run/deac2f873d0ac5b6c3e84b23c4790172.socket --xlator-option
>>>>> *replicate*.node-uuid=adbb7505-3342-4c6d-be3d-75938633612c)
>>>>> [2015-03-13 02:09:41.954173] I [socket.c:3561:socket_init]
>>>>> 0-socket.glusterfsd: SSL support is NOT enabled
>>>>> [2015-03-13 02:09:41.954236] I [socket.c:3576:socket_init]
>>>>> 0-socket.glusterfsd: using system polling thread
>>>>> [2015-03-13 02:09:41.954421] I [socket.c:3561:socket_init] 0-glusterfs:
>>>>> SSL support is NOT enabled
>>>>> [2015-03-13 02:09:41.954443] I [socket.c:3576:socket_init] 0-glusterfs:
>>>>> using system polling thread
>>>>> [2015-03-13 02:09:41.956731] I [graph.c:254:gf_add_cmdline_options]
>>>>> 0-gfsvolume-replicate-0: adding option 'node-uuid' for volume
>>>>> 'gfsvolume-replicate-0' with value
>>> 'adbb7505-3342-4c6d-be3d-75938633612c'
>>>>> [2015-03-13 02:09:41.960210] I
>>> [rpc-clnt.c:972:rpc_clnt_connection_init]
>>>>> 0-gfsvolume-client-1: setting frame-timeout to 90
>>>>> [2015-03-13 02:09:41.960288] I [socket.c:3561:socket_init]
>>>>> 0-gfsvolume-client-1: SSL support is NOT enabled
>>>>> [2015-03-13 02:09:41.960301] I [socket.c:3576:socket_init]
>>>>> 0-gfsvolume-client-1: using system polling thread
>>>>> [2015-03-13 02:09:41.961095] I
>>> [rpc-clnt.c:972:rpc_clnt_connection_init]
>>>>> 0-gfsvolume-client-0: setting frame-timeout to 90
>>>>> [2015-03-13 02:09:41.961134] I [socket.c:3561:socket_init]
>>>>> 0-gfsvolume-client-0: SSL support is NOT enabled
>>>>> [2015-03-13 02:09:41.961145] I [socket.c:3576:socket_init]
>>>>> 0-gfsvolume-client-0: using system polling thread
>>>>> [2015-03-13 02:09:41.961173] I [client.c:2273:notify]
>>>>> 0-gfsvolume-client-0: parent translators are ready, attempting connect
>>> on
>>>>> transport
>>>>> [2015-03-13 02:09:41.961412] I [client.c:2273:notify]
>>>>> 0-gfsvolume-client-1: parent translators are ready, attempting connect
>>> on
>>>>> transport
>>>>> Final graph:
>>>>>
>>>>>
>>> +------------------------------------------------------------------------------+
>>>>> 1: volume gfsvolume-client-0
>>>>> 2: type protocol/client
>>>>> 3: option remote-host gfs1
>>>>> 4: option remote-subvolume /export/sda/brick
>>>>> 5: option transport-type socket
>>>>> 6: option frame-timeout 90
>>>>> 7: option ping-timeout 30
>>>>> 8: end-volume
>>>>> 9:
>>>>> 10: volume gfsvolume-client-1
>>>>> 11: type protocol/client
>>>>> 12: option remote-host gfs2
>>>>> 13: option remote-subvolume /export/sda/brick
>>>>> 14: option transport-type socket
>>>>> 15: option frame-timeout 90
>>>>> 16: option ping-timeout 30
>>>>> 17: end-volume
>>>>> 18:
>>>>> 19: volume gfsvolume-replicate-0
>>>>> 20: type cluster/replicate
>>>>> 21: option node-uuid adbb7505-3342-4c6d-be3d-75938633612c
>>>>> 22: option background-self-heal-count 0
>>>>> 23: option metadata-self-heal on
>>>>> 24: option data-self-heal on
>>>>> 25: option entry-self-heal on
>>>>> 26: option self-heal-daemon on
>>>>> 27: option data-self-heal-algorithm diff
>>>>> 28: option quorum-type fixed
>>>>> 29: option quorum-count 1
>>>>> 30: option iam-self-heal-daemon yes
>>>>> 31: subvolumes gfsvolume-client-0 gfsvolume-client-1
>>>>> 32: end-volume
>>>>> 33:
>>>>> 34: volume glustershd
>>>>> 35: type debug/io-stats
>>>>> 36: subvolumes gfsvolume-replicate-0
>>>>> 37: end-volume
>>>>>
>>>>>
>>> +------------------------------------------------------------------------------+
>>>>> [2015-03-13 02:09:41.961871] I [rpc-clnt.c:1685:rpc_clnt_reconfig]
>>>>> 0-gfsvolume-client-1: changing port to 49153 (from 0)
>>>>> [2015-03-13 02:09:41.962129] I
>>>>> [client-handshake.c:1659:select_server_supported_programs]
>>>>> 0-gfsvolume-client-1: Using Program GlusterFS 3.3, Num (1298437),
>>> Version
>>>>> (330)
>>>>> [2015-03-13 02:09:41.962344] I
>>>>> [client-handshake.c:1456:client_setvolume_cbk] 0-gfsvolume-client-1:
>>>>> Connected to 172.20.20.22:49153, attached to remote volume
>>>>> '/export/sda/brick'.
>>>>> [2015-03-13 02:09:41.962363] I
>>>>> [client-handshake.c:1468:client_setvolume_cbk] 0-gfsvolume-client-1:
>>> Server
>>>>> and Client lk-version numbers are not same, reopening the fds
>>>>> [2015-03-13 02:09:41.962416] I [afr-common.c:3922:afr_notify]
>>>>> 0-gfsvolume-replicate-0: Subvolume 'gfsvolume-client-1' came back up;
>>> going
>>>>> online.
>>>>> [2015-03-13 02:09:41.962487] I
>>>>> [client-handshake.c:450:client_set_lk_version_cbk]
>>> 0-gfsvolume-client-1:
>>>>> Server lk version = 1
>>>>> [2015-03-13 02:09:41.963109] E
>>>>> [afr-self-heald.c:1479:afr_find_child_position]
>>> 0-gfsvolume-replicate-0:
>>>>> getxattr failed on gfsvolume-client-0 - (Transport endpoint is not
>>>>> connected)
>>>>> [2015-03-13 02:09:41.963502] I
>>>>> [afr-self-heald.c:1687:afr_dir_exclusive_crawl]
>>> 0-gfsvolume-replicate-0:
>>>>> Another crawl is in progress for gfsvolume-client-1
>>>>> [2015-03-13 02:09:41.967478] E
>>>>> [afr-self-heal-entry.c:2364:afr_sh_post_nonblocking_entry_cbk]
>>>>> 0-gfsvolume-replicate-0: Non Blocking entrylks failed for
>>>>> <gfid:66af7dc1-a2e6-4919-9ea1-ad75fe2d40b9>.
>>>>> [2015-03-13 02:09:41.968550] E
>>>>> [afr-self-heal-entry.c:2364:afr_sh_post_nonblocking_entry_cbk]
>>>>> 0-gfsvolume-replicate-0: Non Blocking entrylks failed for
>>>>> <gfid:8a7cfa39-9a12-43cd-a9f3-9142b7403d0e>.
>>>>> [2015-03-13 02:09:41.969663] E
>>>>> [afr-self-heal-entry.c:2364:afr_sh_post_nonblocking_entry_cbk]
>>>>> 0-gfsvolume-replicate-0: Non Blocking entrylks failed for
>>>>> <gfid:3762920e-9631-4a52-9a9f-4f04d09e8d84>.
>>>>> [2015-03-13 02:09:41.974345] E
>>>>> [afr-self-heal-entry.c:2364:afr_sh_post_nonblocking_entry_cbk]
>>>>> 0-gfsvolume-replicate-0: Non Blocking entrylks failed for
>>>>> <gfid:66af7dc1-a2e6-4919-9ea1-ad75fe2d40b9>.
>>>>> [2015-03-13 02:09:41.975657] E
>>>>> [afr-self-heal-entry.c:2364:afr_sh_post_nonblocking_entry_cbk]
>>>>> 0-gfsvolume-replicate-0: Non Blocking entrylks failed for
>>>>> <gfid:8a7cfa39-9a12-43cd-a9f3-9142b7403d0e>.
>>>>> [2015-03-13 02:09:41.977020] E
>>>>> [afr-self-heal-entry.c:2364:afr_sh_post_nonblocking_entry_cbk]
>>>>> 0-gfsvolume-replicate-0: Non Blocking entrylks failed for
>>>>> <gfid:3762920e-9631-4a52-9a9f-4f04d09e8d84>.
>>>>> [2015-03-13 02:09:44.307219] I [rpc-clnt.c:1685:rpc_clnt_reconfig]
>>>>> 0-gfsvolume-client-0: changing port to 49153 (from 0)
>>>>> [2015-03-13 02:09:44.307748] I
>>>>> [client-handshake.c:1659:select_server_supported_programs]
>>>>> 0-gfsvolume-client-0: Using Program GlusterFS 3.3, Num (1298437),
>>> Version
>>>>> (330)
>>>>> [2015-03-13 02:09:44.448377] I
>>>>> [client-handshake.c:1456:client_setvolume_cbk] 0-gfsvolume-client-0:
>>>>> Connected to 172.20.20.21:49153, attached to remote volume
>>>>> '/export/sda/brick'.
>>>>> [2015-03-13 02:09:44.448418] I
>>>>> [client-handshake.c:1468:client_setvolume_cbk] 0-gfsvolume-client-0:
>>> Server
>>>>> and Client lk-version numbers are not same, reopening the fds
>>>>> [2015-03-13 02:09:44.448713] I
>>>>> [client-handshake.c:450:client_set_lk_version_cbk]
>>> 0-gfsvolume-client-0:
>>>>> Server lk version = 1
>>>>> [2015-03-13 02:09:44.515112] I
>>>>> [afr-self-heal-common.c:2859:afr_log_self_heal_completion_status]
>>>>> 0-gfsvolume-replicate-0: foreground data self heal is successfully
>>>>> completed, data self heal from gfsvolume-client-0 to sinks
>>>>> gfsvolume-client-1, with 892928 bytes on gfsvolume-client-0, 892928
>>> bytes
>>>>> on gfsvolume-client-1, data - Pending matrix: [ [ 0 155762 ] [ 0 0 ]
>>> ]
>>>>> on <gfid:123536cc-c34b-43d7-b0c6-cf80eefa8322>
>>>>> [2015-03-13 02:09:44.809988] I
>>>>> [afr-self-heal-common.c:2859:afr_log_self_heal_completion_status]
>>>>> 0-gfsvolume-replicate-0: foreground data self heal is successfully
>>>>> completed, data self heal from gfsvolume-client-0 to sinks
>>>>> gfsvolume-client-1, with 15998976 bytes on gfsvolume-client-0,
>>> 15998976
>>>>> bytes on gfsvolume-client-1, data - Pending matrix: [ [ 0 36506 ] [
>>> 0 0 ]
>>>>> ] on <gfid:b6dc0e74-31bf-469a-b629-ee51ab4cf729>
>>>>> [2015-03-13 02:09:44.946050] W
>>>>> [client-rpc-fops.c:574:client3_3_readlink_cbk] 0-gfsvolume-client-0:
>>> remote
>>>>> operation failed: Stale NFS file handle
>>>>> [2015-03-13 02:09:44.946097] I
>>>>> [afr-self-heal-entry.c:1538:afr_sh_entry_impunge_readlink_sink_cbk]
>>>>> 0-gfsvolume-replicate-0: readlink of
>>>>> <gfid:66af7dc1-a2e6-4919-9ea1-ad75fe2d40b9>/PB2_corrected.fastq on
>>>>> gfsvolume-client-1 failed (Stale NFS file handle)
>>>>> [2015-03-13 02:09:44.951370] I
>>>>> [afr-self-heal-entry.c:2321:afr_sh_entry_fix] 0-gfsvolume-replicate-0:
>>>>> <gfid:8a7cfa39-9a12-43cd-a9f3-9142b7403d0e>: Performing conservative
>>> merge
>>>>> [2015-03-13 02:09:45.149995] W
>>>>> [client-rpc-fops.c:574:client3_3_readlink_cbk] 0-gfsvolume-client-0:
>>> remote
>>>>> operation failed: Stale NFS file handle
>>>>> [2015-03-13 02:09:45.150036] I
>>>>> [afr-self-heal-entry.c:1538:afr_sh_entry_impunge_readlink_sink_cbk]
>>>>> 0-gfsvolume-replicate-0: readlink of
>>>>> <gfid:8a7cfa39-9a12-43cd-a9f3-9142b7403d0e>/Rscript on
>>> gfsvolume-client-1
>>>>> failed (Stale NFS file handle)
>>>>> [2015-03-13 02:09:45.214253] W
>>>>> [client-rpc-fops.c:574:client3_3_readlink_cbk] 0-gfsvolume-client-0:
>>> remote
>>>>> operation failed: Stale NFS file handle
>>>>> [2015-03-13 02:09:45.214295] I
>>>>> [afr-self-heal-entry.c:1538:afr_sh_entry_impunge_readlink_sink_cbk]
>>>>> 0-gfsvolume-replicate-0: readlink of
>>>>> <gfid:3762920e-9631-4a52-9a9f-4f04d09e8d84>/ananas_d_tmp on
>>>>> gfsvolume-client-1 failed (Stale NFS file handle)
>>>>> [2015-03-13 02:13:27.324856] W [socket.c:522:__socket_rwv]
>>>>> 0-gfsvolume-client-1: readv on 172.20.20.22:49153 failed (No data
>>>>> available)
>>>>> [2015-03-13 02:13:27.324961] I [client.c:2208:client_rpc_notify]
>>>>> 0-gfsvolume-client-1: disconnected from 172.20.20.22:49153. Client
>>>>> process will keep trying to connect to glusterd until brick's port is
>>>>> available
>>>>> [2015-03-13 02:13:37.981531] I [rpc-clnt.c:1685:rpc_clnt_reconfig]
>>>>> 0-gfsvolume-client-1: changing port to 49153 (from 0)
>>>>> [2015-03-13 02:13:37.981781] E [socket.c:2161:socket_connect_finish]
>>>>> 0-gfsvolume-client-1: connection to 172.20.20.22:49153 failed
>>> (Connection
>>>>> refused)
>>>>> [2015-03-13 02:13:41.982125] I [rpc-clnt.c:1685:rpc_clnt_reconfig]
>>>>> 0-gfsvolume-client-1: changing port to 49153 (from 0)
>>>>> [2015-03-13 02:13:41.982353] E [socket.c:2161:socket_connect_finish]
>>>>> 0-gfsvolume-client-1: connection to 172.20.20.22:49153 failed
>>> (Connection
>>>>> refused)
>>>>> [2015-03-13 02:13:45.982693] I [rpc-clnt.c:1685:rpc_clnt_reconfig]
>>>>> 0-gfsvolume-client-1: changing port to 49153 (from 0)
>>>>> [2015-03-13 02:13:45.982926] E [socket.c:2161:socket_connect_finish]
>>>>> 0-gfsvolume-client-1: connection to 172.20.20.22:49153 failed
>>> (Connection
>>>>> refused)
>>>>> [2015-03-13 02:13:49.983309] I [rpc-clnt.c:1685:rpc_clnt_reconfig]
>>>>> 0-gfsvolume-client-1: changing port to 49153 (from 0)
>>>>>
>>>>>
>>>>
>>>> Any help would be greatly appreciated.
>>>> Thank You Kindly,
>>>> Kaamesh
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Gluster-users mailing list
>>>> Gluster-users at gluster.org
>>>> http://www.gluster.org/mailman/listinfo/gluster-users
>>>>
>>>
>>>
>>>
>>
>
--
~Atin
More information about the Gluster-users
mailing list