[Gluster-users] glusterfsd crashing

Tue Mar 14 17:39:40 UTC 2017

The crashes seem to have stopped after I downgraded the one machine to
match the others.

On Fri, Mar 10, 2017 at 11:50 AM, Sergei Gerasenko <gerases at gmail.com>
wrote:

> I see why it's not saving the cores: the package isn't signed with the
> right signature. I will modify the abrd configs to change that behavior and
> wait for the next crash.
>
> On Fri, Mar 10, 2017 at 11:23 AM, Vijay Bellur <vbellur at redhat.com> wrote:
>
>>
>>
>> On Fri, Mar 10, 2017 at 11:17 AM, Sergei Gerasenko <gerases at gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> I'm running gluster 3.7.12. It's an 8-node distributed, replicated
>>> cluster (replica 2). It's had been working fine for a long time when all of
>>> a sudden I started seeing bricks going offline. Researching further I found
>>> messages like this:
>>>
>>> Mar 10 00:02:12 HOSTNAME data-ftp_gluster_brick[23769]: pending frames:
>>> Mar 10 00:02:12 HOSTNAME data-ftp_gluster_brick[23769]: frame : type(0)
>>> op(5)
>>> Mar 10 00:02:12 HOSTNAME data-ftp_gluster_brick[23769]: patchset: git://
>>> git.gluster.com/glusterfs.git
>>> Mar 10 00:02:12 HOSTNAME data-ftp_gluster_brick[23769]: signal received:
>>> 6
>>> Mar 10 00:02:12 HOSTNAME data-ftp_gluster_brick[23769]: time of crash:
>>> Mar 10 00:02:12 HOSTNAME data-ftp_gluster_brick[23769]: 2017-03-10
>>> 05:02:12
>>> Mar 10 00:02:12 HOSTNAME data-ftp_gluster_brick[23769]: configuration
>>> details:
>>> Mar 10 00:02:12 HOSTNAME data-ftp_gluster_brick[23769]: argp 1
>>> Mar 10 00:02:12 HOSTNAME data-ftp_gluster_brick[23769]: backtrace 1
>>> Mar 10 00:02:12 HOSTNAME data-ftp_gluster_brick[23769]: dlfcn 1
>>> Mar 10 00:02:12 HOSTNAME data-ftp_gluster_brick[23769]: libpthread 1
>>> Mar 10 00:02:12 HOSTNAME data-ftp_gluster_brick[23769]: llistxattr 1
>>> Mar 10 00:02:12 HOSTNAME data-ftp_gluster_brick[23769]: setfsid 1
>>> Mar 10 00:02:12 HOSTNAME data-ftp_gluster_brick[23769]: spinlock 1
>>> Mar 10 00:02:12 HOSTNAME data-ftp_gluster_brick[23769]: epoll.h 1
>>> Mar 10 00:02:12 HOSTNAME data-ftp_gluster_brick[23769]: xattr.h 1
>>> Mar 10 00:02:12 HOSTNAME data-ftp_gluster_brick[23769]: st_atim.tv_nsec 1
>>> Mar 10 00:02:12 HOSTNAME data-ftp_gluster_brick[23769]: package-string:
>>> glusterfs 3.7.12
>>> Mar 10 00:02:12 HOSTNAME data-ftp_gluster_brick[23769]: ---------
>>>
>>> I initially thought it was related to quota support (based on some
>>> googling), so I turned off quota and also disabled NFS support to simplify
>>> the debugging. Every time after the crash, I restarted gluster and the
>>> bricks would go online for several hours only to crash again later. There
>>> are lots of messages like this preceding the crash:
>>>
>>> ...
>>> [2017-03-10 04:40:46.002225] E [MSGID: 113091]
>>> [posix.c:178:posix_lookup] 0-ftp_volume-posix: null gfid for path (null)
>>> [2017-03-10 04:40:46.002278] E [MSGID: 113018]
>>> [posix.c:196:posix_lookup] 0-ftp_volume-posix: lstat on null failed
>>> [Invalid argument]
>>> The message "E [MSGID: 113091] [posix.c:178:posix_lookup]
>>> 0-ftp_volume-posix: null gfid for path (null)" repeated 3 times between
>>> [2017-03-10 04:40:46.002225] and [2017-03-10 04:40:46.005699]
>>> The message "E [MSGID: 113018] [posix.c:196:posix_lookup]
>>> 0-ftp_volume-posix: lstat on null failed [Invalid argument]" repeated 3
>>> times between [2017-03-10 04:40:46.002278] and [2017-03-10 04:40:46.005701]
>>> [2017-03-10 04:50:47.002170] E [MSGID: 113091]
>>> [posix.c:178:posix_lookup] 0-ftp_volume-posix: null gfid for path (null)
>>> [2017-03-10 04:50:47.002219] E [MSGID: 113018]
>>> [posix.c:196:posix_lookup] 0-ftp_volume-posix: lstat on null failed
>>> [Invalid argument]
>>> The message "E [MSGID: 113091] [posix.c:178:posix_lookup]
>>> 0-ftp_volume-posix: null gfid for path (null)" repeated 3 times between
>>> [2017-03-10 04:50:47.002170] and [2017-03-10 04:50:47.005623]
>>> The message "E [MSGID: 113018] [posix.c:196:posix_lookup]
>>> 0-ftp_volume-posix: lstat on null failed [Invalid argument]" repeated 3
>>> times between [2017-03-10 04:50:47.002219] and [2017-03-10 04:50:47.005625]
>>> [2017-03-10 05:00:48.002246] E [MSGID: 113091]
>>> [posix.c:178:posix_lookup] 0-ftp_volume-posix: null gfid for path (null)
>>> [2017-03-10 05:00:48.002314] E [MSGID: 113018]
>>> [posix.c:196:posix_lookup] 0-ftp_volume-posix: lstat on null failed
>>> [Invalid argument]
>>> The message "E [MSGID: 113091] [posix.c:178:posix_lookup]
>>> 0-ftp_volume-posix: null gfid for path (null)" repeated 3 times between
>>> [2017-03-10 05:00:48.002246] and [2017-03-10 05:00:48.005828]
>>> The message "E [MSGID: 113018] [posix.c:196:posix_lookup]
>>> 0-ftp_volume-posix: lstat on null failed [Invalid argument]" repeated 3
>>> times between [2017-03-10 05:00:48.002314] and [2017-03-10 05:00:48.005830]
>>>
>>> One important detail I noticed yesterday is that one of the nodes was
>>> running gluster version 3.7.13! I'm not sure what did the upgrade. So I
>>> downgraded to 3.7.12 and restarted gluster. The crash above happened
>>> several hours later. But again, the crashes had been happening before the
>>> downgrade -- possibly because of the version mismatch on one of the nodes.
>>>
>>> Anybody have any ideas?
>>>
>>>
>>
>> Do you have the core files from the crashes? If so, can you please
>> provide a gdb backtrace from one of the core files?
>>
>> Thanks,
>> Vijay
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20170314/1b907b41/attachment.html>