[Gluster-users] Client un-mounting since upgrade to 3.12.9-1 version

Mon Jun 18 04:09:17 UTC 2018

On Mon, Jun 18, 2018 at 8:11 AM, Raghavendra Gowdappa <rgowdapp at redhat.com>
wrote:

> From the bt:
>
> #8  0x00007f6ef977e6de in rda_readdirp (frame=0x7f6eec862320,
> this=0x7f6ef4019f20, fd=0x7f6ed40077b0, size=357, off=2,
> xdata=0x7f6eec0085a0) at readdir-ahead.c:266
> #9  0x00007f6ef952db4c in dht_readdirp_cbk (frame=<value optimized out>,
> cookie=0x7f6ef4019f20, this=0x7f6ef40218a0, op_ret=2, op_errno=0,
> orig_entries=<value optimized out>, xdata=0x7f6eec0085a0) at
> dht-common.c:5388
> #10 0x00007f6ef977e7d7 in rda_readdirp (frame=0x7f6eec862210,
> this=0x7f6ef4019f20, fd=0x7f6ed40077b0, size=140114606084288, off=2,
> xdata=0x7f6eec0085a0) at readdir-ahead.c:266
> #11 0x00007f6ef952db4c in dht_readdirp_cbk (frame=<value optimized out>,
> cookie=0x7f6ef4019f20, this=0x7f6ef40218a0, op_ret=2, op_errno=0,
> orig_entries=<value optimized out>, xdata=0x7f6eec0085a0) at
> dht-common.c:5388
> #12 0x00007f6ef977e7d7 in rda_readdirp (frame=0x7f6eec862100,
> this=0x7f6ef4019f20, fd=0x7f6ed40077b0, size=140114606084288, off=2,
> xdata=0x7f6eec0085a0) at readdir-ahead.c:266
> #13 0x00007f6ef952db4c in dht_readdirp_cbk (frame=<value optimized out>,
> cookie=0x7f6ef4019f20, this=0x7f6ef40218a0, op_ret=2, op_errno=0,
> orig_entries=<value optimized out>, xdata=0x7f6eec0085a0) at
> dht-common.c:5388
> #14 0x00007f6ef977e7d7 in rda_readdirp (frame=0x7f6eec861ff0,
> this=0x7f6ef4019f20, fd=0x7f6ed40077b0, size=140114606084288, off=2,
> xdata=0x7f6eec0085a0) at readdir-ahead.c:266
> #15 0x00007f6ef952db4c in dht_readdirp_cbk (frame=<value optimized out>,
> cookie=0x7f6ef4019f20, this=0x7f6ef40218a0, op_ret=2, op_errno=0,
> orig_entries=<value optimized out>, xdata=0x7f6eec0085a0) at
> dht-common.c:5388
> #16 0x00007f6ef977e7d7 in rda_readdirp (frame=0x7f6eec861ee0,
> this=0x7f6ef4019f20, fd=0x7f6ed40077b0, size=140114606084288, off=2,
> xdata=0x7f6eec0085a0) at readdir-ahead.c:266
> #17 0x00007f6ef952db4c in dht_readdirp_cbk (frame=<value optimized out>,
> cookie=0x7f6ef4019f20, this=0x7f6ef40218a0, op_ret=2, op_errno=0,
> orig_entries=<value optimized out>, xdata=0x7f6eec0085a0) at
> dht-common.c:5388
> #18 0x00007f6ef977e7d7 in rda_readdirp (frame=0x7f6eec861dd0,
> this=0x7f6ef4019f20, fd=0x7f6ed40077b0, size=140114606084288, off=2,
> xdata=0x7f6eec0085a0) at readdir-ahead.c:266
> #19 0x00007f6ef952db4c in dht_readdirp_cbk (frame=<value optimized out>,
> cookie=0x7f6ef4019f20, this=0x7f6ef40218a0, op_ret=2, op_errno=0,
> orig_entries=<value optimized out>, xdata=0x7f6eec0085a0) at
> dht-common.c:5388
> #20 0x00007f6ef977e7d7 in rda_readdirp (frame=0x7f6eec861cc0,
> this=0x7f6ef4019f20, fd=0x7f6ed40077b0, size=140114606084288, off=2,
> xdata=0x7f6eec0085a0) at readdir-ahead.c:266
> #21 0x00007f6ef952db4c in dht_readdirp_cbk (frame=<value optimized out>,
> cookie=0x7f6ef4019f20, this=0x7f6ef40218a0, op_ret=2, op_errno=0,
> orig_entries=<value optimized out>, xdata=0x7f6eec0085a0) at
> dht-common.c:5388
> #22 0x00007f6ef977e7d7 in rda_readdirp (frame=0x7f6eec861bb0,
> this=0x7f6ef4019f20, fd=0x7f6ed40077b0, size=140114606084288, off=2,
> xdata=0x7f6eec0085a0) at readdir-ahead.c:266
> #23 0x00007f6ef952db4c in dht_readdirp_cbk (frame=<value optimized out>,
> cookie=0x7f6ef4019f20, this=0x7f6ef40218a0, op_ret=2, op_errno=0,
> orig_entries=<value optimized out>, xdata=0x7f6eec0085a0) at
> dht-common.c:5388
>
> It looks like an infinite recursion. Note that readdirp is wound to the
> same subvol (value of "this" is same in all calls to rda_readdirp) at the
> same offset (of value 2). This may be a bug in DHT (winding down readdirp
> with wrong offset) or in readdir-ahead (populating incorrect offset values
> in dentries it returns as readdirp response).
>

It looks to be a corruption. Value of size argument in rda_readdirp is too
big (around 127 TB) to be sane. If you've a reproducer, please run it in
valgrind or ASAN.

To make it explicit, ATM its not clear that there is bug in readdir-ahead
or DHT as it looks to be a memory corruption. Till I get a reproducer or
valgrind/ASAN output of client process when the issue occcurs, I won't be
working on this problem.

>
> On Wed, Jun 13, 2018 at 4:29 PM, mohammad kashif <kashif.alig at gmail.com>
> wrote:
>
>> Hi Milind
>>
>> Thanks a lot, I manage to run gdb and produced traceback as well. Its here
>>
>> http://www-pnp.physics.ox.ac.uk/~mohammad/backtrace.log
>>
>>
>> I am trying to understand but still not able to make sense out of it.
>>
>> Thanks
>>
>> Kashif
>>
>> On Wed, Jun 13, 2018 at 11:34 AM, Milind Changire <mchangir at redhat.com>
>> wrote:
>>
>>> Kashif,
>>> FYI: http://debuginfo.centos.org/centos/6/storage/x86_64/
>>>
>>>
>>> On Wed, Jun 13, 2018 at 3:21 PM, mohammad kashif <kashif.alig at gmail.com>
>>> wrote:
>>>
>>>> Hi Milind
>>>>
>>>> There is no glusterfs-debuginfo available for gluster-3.12 from
>>>> http://mirror.centos.org/centos/6/storage/x86_64/gluster-3.12/ repo.
>>>> Do you know from where I can get it?
>>>> Also when I run gdb, it says
>>>>
>>>> Missing separate debuginfos, use: debuginfo-install
>>>> glusterfs-fuse-3.12.9-1.el6.x86_64
>>>>
>>>> I can't find debug package for glusterfs-fuse either
>>>>
>>>> Thanks from the pit of despair ;)
>>>>
>>>> Kashif
>>>>
>>>>
>>>> On Tue, Jun 12, 2018 at 5:01 PM, mohammad kashif <kashif.alig at gmail.com
>>>> > wrote:
>>>>
>>>>> Hi Milind
>>>>>
>>>>> I will send you links for logs.
>>>>>
>>>>> I collected these core dumps at client and there is no glusterd
>>>>> process running on client.
>>>>>
>>>>> Kashif
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Jun 12, 2018 at 4:14 PM, Milind Changire <mchangir at redhat.com>
>>>>> wrote:
>>>>>
>>>>>> Kashif,
>>>>>> Could you also send over the client/mount log file as Vijay suggested
>>>>>> ?
>>>>>> Or maybe the lines with the crash backtrace lines
>>>>>>
>>>>>> Also, you've mentioned that you straced glusterd, but when you ran
>>>>>> gdb, you ran it over /usr/sbin/glusterfs
>>>>>>
>>>>>>
>>>>>> On Tue, Jun 12, 2018 at 8:19 PM, Vijay Bellur <vbellur at redhat.com>
>>>>>> wrote:
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Jun 12, 2018 at 7:40 AM, mohammad kashif <
>>>>>>> kashif.alig at gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi Milind
>>>>>>>>
>>>>>>>> The operating system is Scientific Linux 6 which is based on RHEL6.
>>>>>>>> The cpu arch is Intel x86_64.
>>>>>>>>
>>>>>>>> I will send you a separate email with link to core dump.
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> You could also grep for crash in the client log file and the lines
>>>>>>> following crash would have a backtrace in most cases.
>>>>>>>
>>>>>>> HTH,
>>>>>>> Vijay
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> Thanks for your help.
>>>>>>>>
>>>>>>>> Kashif
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Jun 12, 2018 at 3:16 PM, Milind Changire <
>>>>>>>> mchangir at redhat.com> wrote:
>>>>>>>>
>>>>>>>>> Kashif,
>>>>>>>>> Could you share the core dump via Google Drive or something similar
>>>>>>>>>
>>>>>>>>> Also, let me know the CPU arch and OS Distribution on which you
>>>>>>>>> are running gluster.
>>>>>>>>>
>>>>>>>>> If you've installed the glusterfs-debuginfo package, you'll also
>>>>>>>>> get the source lines in the backtrace via gdb
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Jun 12, 2018 at 5:59 PM, mohammad kashif <
>>>>>>>>> kashif.alig at gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Milind, Vijay
>>>>>>>>>>
>>>>>>>>>> Thanks, I have some more information now as I straced glusterd on
>>>>>>>>>> client
>>>>>>>>>>
>>>>>>>>>> 138544      0.000131 mprotect(0x7f2f70785000, 4096,
>>>>>>>>>> PROT_READ|PROT_WRITE) = 0 <0.000026>
>>>>>>>>>> 138544      0.000128 mprotect(0x7f2f70786000, 4096,
>>>>>>>>>> PROT_READ|PROT_WRITE) = 0 <0.000027>
>>>>>>>>>> 138544      0.000126 mprotect(0x7f2f70787000, 4096,
>>>>>>>>>> PROT_READ|PROT_WRITE) = 0 <0.000027>
>>>>>>>>>> 138544      0.000124 --- SIGSEGV {si_signo=SIGSEGV,
>>>>>>>>>> si_code=SEGV_ACCERR, si_addr=0x7f2f7c60ef88} ---
>>>>>>>>>> 138544      0.000051 --- SIGSEGV {si_signo=SIGSEGV,
>>>>>>>>>> si_code=SI_KERNEL, si_addr=0} ---
>>>>>>>>>> 138551      0.105048 +++ killed by SIGSEGV (core dumped) +++
>>>>>>>>>> 138550      0.000041 +++ killed by SIGSEGV (core dumped) +++
>>>>>>>>>> 138547      0.000008 +++ killed by SIGSEGV (core dumped) +++
>>>>>>>>>> 138546      0.000007 +++ killed by SIGSEGV (core dumped) +++
>>>>>>>>>> 138545      0.000007 +++ killed by SIGSEGV (core dumped) +++
>>>>>>>>>> 138544      0.000008 +++ killed by SIGSEGV (core dumped) +++
>>>>>>>>>> 138543      0.000007 +++ killed by SIGSEGV (core dumped) +++
>>>>>>>>>>
>>>>>>>>>> As for I understand that somehow gluster is trying to access
>>>>>>>>>> memory in appropriate manner and kernel sends SIGSEGV
>>>>>>>>>>
>>>>>>>>>> I also got the core dump. I am trying gdb first time so I am not
>>>>>>>>>> sure whether I am using it correctly
>>>>>>>>>>
>>>>>>>>>> gdb /usr/sbin/glusterfs core.138536
>>>>>>>>>>
>>>>>>>>>> It just tell me that program terminated with signal 11,
>>>>>>>>>> segmentation fault .
>>>>>>>>>>
>>>>>>>>>> The problem is not limited to one client but happening to many
>>>>>>>>>> clients.
>>>>>>>>>>
>>>>>>>>>> I will really appreciate any help as whole file system has become
>>>>>>>>>> unusable
>>>>>>>>>>
>>>>>>>>>> Thanks
>>>>>>>>>>
>>>>>>>>>> Kashif
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Tue, Jun 12, 2018 at 12:26 PM, Milind Changire <
>>>>>>>>>> mchangir at redhat.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Kashif,
>>>>>>>>>>> You can change the log level by:
>>>>>>>>>>> $ gluster volume set <vol> diagnostics.brick-log-level TRACE
>>>>>>>>>>> $ gluster volume set <vol> diagnostics.client-log-level TRACE
>>>>>>>>>>>
>>>>>>>>>>> and see how things fare
>>>>>>>>>>>
>>>>>>>>>>> If you want fewer logs you can change the log-level to DEBUG
>>>>>>>>>>> instead of TRACE.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Jun 12, 2018 at 3:37 PM, mohammad kashif <
>>>>>>>>>>> kashif.alig at gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Vijay
>>>>>>>>>>>>
>>>>>>>>>>>> Now it is unmounting every 30 mins !
>>>>>>>>>>>>
>>>>>>>>>>>> The server log at /var/log/glusterfs/bricks/glusteratlas-brics001-gv0.log
>>>>>>>>>>>> have this line only
>>>>>>>>>>>>
>>>>>>>>>>>> 2018-06-12 09:53:19.303102] I [MSGID: 115013]
>>>>>>>>>>>> [server-helpers.c:289:do_fd_cleanup] 0-atlasglust-server: fd
>>>>>>>>>>>> cleanup on /atlas/atlasdata/zgubic/hmumu/
>>>>>>>>>>>> histograms/v14.3/Signal
>>>>>>>>>>>> [2018-06-12 09:53:19.306190] I [MSGID: 101055]
>>>>>>>>>>>> [client_t.c:443:gf_client_unref] 0-atlasglust-server: Shutting
>>>>>>>>>>>> down connection <server-name> -2224879-2018/06/12-09:51:01:4
>>>>>>>>>>>> 60889-atlasglust-client-0-0-0
>>>>>>>>>>>>
>>>>>>>>>>>> There is no other information. Is there any way to increase log
>>>>>>>>>>>> verbosity?
>>>>>>>>>>>>
>>>>>>>>>>>> on the client
>>>>>>>>>>>>
>>>>>>>>>>>> 2018-06-12 09:51:01.744980] I [MSGID: 114057]
>>>>>>>>>>>> [client-handshake.c:1478:select_server_supported_programs]
>>>>>>>>>>>> 0-atlasglust-client-5: Using Program GlusterFS 3.3, Num (1298437), Version
>>>>>>>>>>>> (330)
>>>>>>>>>>>> [2018-06-12 09:51:01.746508] I [MSGID: 114046]
>>>>>>>>>>>> [client-handshake.c:1231:client_setvolume_cbk]
>>>>>>>>>>>> 0-atlasglust-client-5: Connected to atlasglust-client-5, attached to remote
>>>>>>>>>>>> volume '/glusteratlas/brick006/gv0'.
>>>>>>>>>>>> [2018-06-12 09:51:01.746543] I [MSGID: 114047]
>>>>>>>>>>>> [client-handshake.c:1242:client_setvolume_cbk]
>>>>>>>>>>>> 0-atlasglust-client-5: Server and Client lk-version numbers are not same,
>>>>>>>>>>>> reopening the fds
>>>>>>>>>>>> [2018-06-12 09:51:01.746814] I [MSGID: 114035]
>>>>>>>>>>>> [client-handshake.c:202:client_set_lk_version_cbk]
>>>>>>>>>>>> 0-atlasglust-client-5: Server lk version = 1
>>>>>>>>>>>> [2018-06-12 09:51:01.748449] I [MSGID: 114057]
>>>>>>>>>>>> [client-handshake.c:1478:select_server_supported_programs]
>>>>>>>>>>>> 0-atlasglust-client-6: Using Program GlusterFS 3.3, Num (1298437), Version
>>>>>>>>>>>> (330)
>>>>>>>>>>>> [2018-06-12 09:51:01.750219] I [MSGID: 114046]
>>>>>>>>>>>> [client-handshake.c:1231:client_setvolume_cbk]
>>>>>>>>>>>> 0-atlasglust-client-6: Connected to atlasglust-client-6, attached to remote
>>>>>>>>>>>> volume '/glusteratlas/brick007/gv0'.
>>>>>>>>>>>> [2018-06-12 09:51:01.750261] I [MSGID: 114047]
>>>>>>>>>>>> [client-handshake.c:1242:client_setvolume_cbk]
>>>>>>>>>>>> 0-atlasglust-client-6: Server and Client lk-version numbers are not same,
>>>>>>>>>>>> reopening the fds
>>>>>>>>>>>> [2018-06-12 09:51:01.750503] I [MSGID: 114035]
>>>>>>>>>>>> [client-handshake.c:202:client_set_lk_version_cbk]
>>>>>>>>>>>> 0-atlasglust-client-6: Server lk version = 1
>>>>>>>>>>>> [2018-06-12 09:51:01.752207] I [fuse-bridge.c:4205:fuse_init]
>>>>>>>>>>>> 0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.24 kernel
>>>>>>>>>>>> 7.14
>>>>>>>>>>>> [2018-06-12 09:51:01.752261] I [fuse-bridge.c:4835:fuse_graph_sync]
>>>>>>>>>>>> 0-fuse: switched to graph 0
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> is there a problem with server and client 1k version?
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks for your help.
>>>>>>>>>>>>
>>>>>>>>>>>> Kashif
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Jun 11, 2018 at 11:52 PM, Vijay Bellur <
>>>>>>>>>>>> vbellur at redhat.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Mon, Jun 11, 2018 at 8:50 AM, mohammad kashif <
>>>>>>>>>>>>> kashif.alig at gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Since I have updated our gluster server and client to latest
>>>>>>>>>>>>>> version 3.12.9-1, I am having this issue of gluster getting unmounted from
>>>>>>>>>>>>>> client very regularly. It was not a problem before update.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Its a distributed file system with no replication. We have
>>>>>>>>>>>>>> seven servers totaling around 480TB data. Its 97% full.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I am using following config on server
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> gluster volume set atlasglust features.cache-invalidation on
>>>>>>>>>>>>>> gluster volume set atlasglust features.cache-invalidation-timeout
>>>>>>>>>>>>>> 600
>>>>>>>>>>>>>> gluster volume set atlasglust performance.stat-prefetch on
>>>>>>>>>>>>>> gluster volume set atlasglust performance.cache-invalidation
>>>>>>>>>>>>>> on
>>>>>>>>>>>>>> gluster volume set atlasglust performance.md-cache-timeout 600
>>>>>>>>>>>>>> gluster volume set atlasglust performance.parallel-readdir on
>>>>>>>>>>>>>> gluster volume set atlasglust performance.cache-size 1GB
>>>>>>>>>>>>>> gluster volume set atlasglust performance.client-io-threads on
>>>>>>>>>>>>>> gluster volume set atlasglust cluster.lookup-optimize on
>>>>>>>>>>>>>> gluster volume set atlasglust performance.stat-prefetch on
>>>>>>>>>>>>>> gluster volume set atlasglust client.event-threads 4
>>>>>>>>>>>>>> gluster volume set atlasglust server.event-threads 4
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> clients are mounted with this option
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> defaults,direct-io-mode=disable,attribute-timeout=600,entry-
>>>>>>>>>>>>>> timeout=600,negative-timeout=600,fopen-keep-cache,rw,_netdev
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I can't see anything in the log file. Can someone suggest
>>>>>>>>>>>>>> that how to troubleshoot this issue?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Can you please share the log file? Checking for messages
>>>>>>>>>>>>> related to disconnections/crashes in the log file would be a good way to
>>>>>>>>>>>>> start troubleshooting the problem.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Vijay
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> Gluster-users mailing list
>>>>>>>>>>>> Gluster-users at gluster.org
>>>>>>>>>>>> http://lists.gluster.org/mailman/listinfo/gluster-users
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Milind
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Milind
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Milind
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> Milind
>>>
>>>
>>
>> _______________________________________________
>> Gluster-users mailing list
>> Gluster-users at gluster.org
>> http://lists.gluster.org/mailman/listinfo/gluster-users
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20180618/1e8c1b17/attachment.html>