[Gluster-users] Client un-mounting since upgrade to 3.12.9-1 version

Nithya Balachandran nbalacha at redhat.com
Thu Jun 14 04:39:39 UTC 2018


+Poornima who works on parallel-readdir.

@Poornima, Have you seen anything like this before?

On 14 June 2018 at 10:07, Nithya Balachandran <nbalacha at redhat.com> wrote:

> This is not the same issue as the one you are referring - that was in the
> RPC layer and caused the bricks to crash. This one is different as it seems
> to be in the dht and rda layers. It does look like a stack overflow though.
>
> @Mohammad,
>
> Please send the following information:
>
> 1. gluster volume info
> 2. The number of entries in the directory being listed
> 3. System memory
>
> Does this still happen if you turn off parallel-readdir?
>
> Regards,
> Nithya
>
>
>
>
> On 13 June 2018 at 16:40, Milind Changire <mchangir at redhat.com> wrote:
>
>> +Nithya
>>
>> Nithya,
>> Do these logs [1]  look similar to the recursive readdir() issue that you
>> encountered just a while back ?
>> i.e. recursive readdir() response definition in the XDR
>>
>> [1] http://www-pnp.physics.ox.ac.uk/~mohammad/backtrace.log
>>
>>
>> On Wed, Jun 13, 2018 at 4:29 PM, mohammad kashif <kashif.alig at gmail.com>
>> wrote:
>>
>>> Hi Milind
>>>
>>> Thanks a lot, I manage to run gdb and produced traceback as well. Its
>>> here
>>>
>>> http://www-pnp.physics.ox.ac.uk/~mohammad/backtrace.log
>>>
>>>
>>> I am trying to understand but still not able to make sense out of it.
>>>
>>> Thanks
>>>
>>> Kashif
>>>
>>> On Wed, Jun 13, 2018 at 11:34 AM, Milind Changire <mchangir at redhat.com>
>>> wrote:
>>>
>>>> Kashif,
>>>> FYI: http://debuginfo.centos.org/centos/6/storage/x86_64/
>>>>
>>>>
>>>> On Wed, Jun 13, 2018 at 3:21 PM, mohammad kashif <kashif.alig at gmail.com
>>>> > wrote:
>>>>
>>>>> Hi Milind
>>>>>
>>>>> There is no glusterfs-debuginfo available for gluster-3.12 from
>>>>> http://mirror.centos.org/centos/6/storage/x86_64/gluster-3.12/ repo.
>>>>> Do you know from where I can get it?
>>>>> Also when I run gdb, it says
>>>>>
>>>>> Missing separate debuginfos, use: debuginfo-install
>>>>> glusterfs-fuse-3.12.9-1.el6.x86_64
>>>>>
>>>>> I can't find debug package for glusterfs-fuse either
>>>>>
>>>>> Thanks from the pit of despair ;)
>>>>>
>>>>> Kashif
>>>>>
>>>>>
>>>>> On Tue, Jun 12, 2018 at 5:01 PM, mohammad kashif <
>>>>> kashif.alig at gmail.com> wrote:
>>>>>
>>>>>> Hi Milind
>>>>>>
>>>>>> I will send you links for logs.
>>>>>>
>>>>>> I collected these core dumps at client and there is no glusterd
>>>>>> process running on client.
>>>>>>
>>>>>> Kashif
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Jun 12, 2018 at 4:14 PM, Milind Changire <mchangir at redhat.com
>>>>>> > wrote:
>>>>>>
>>>>>>> Kashif,
>>>>>>> Could you also send over the client/mount log file as Vijay
>>>>>>> suggested ?
>>>>>>> Or maybe the lines with the crash backtrace lines
>>>>>>>
>>>>>>> Also, you've mentioned that you straced glusterd, but when you ran
>>>>>>> gdb, you ran it over /usr/sbin/glusterfs
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Jun 12, 2018 at 8:19 PM, Vijay Bellur <vbellur at redhat.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Jun 12, 2018 at 7:40 AM, mohammad kashif <
>>>>>>>> kashif.alig at gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi Milind
>>>>>>>>>
>>>>>>>>> The operating system is Scientific Linux 6 which is based on
>>>>>>>>> RHEL6. The cpu arch is Intel x86_64.
>>>>>>>>>
>>>>>>>>> I will send you a separate email with link to core dump.
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> You could also grep for crash in the client log file and the lines
>>>>>>>> following crash would have a backtrace in most cases.
>>>>>>>>
>>>>>>>> HTH,
>>>>>>>> Vijay
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks for your help.
>>>>>>>>>
>>>>>>>>> Kashif
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Jun 12, 2018 at 3:16 PM, Milind Changire <
>>>>>>>>> mchangir at redhat.com> wrote:
>>>>>>>>>
>>>>>>>>>> Kashif,
>>>>>>>>>> Could you share the core dump via Google Drive or something
>>>>>>>>>> similar
>>>>>>>>>>
>>>>>>>>>> Also, let me know the CPU arch and OS Distribution on which you
>>>>>>>>>> are running gluster.
>>>>>>>>>>
>>>>>>>>>> If you've installed the glusterfs-debuginfo package, you'll also
>>>>>>>>>> get the source lines in the backtrace via gdb
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Tue, Jun 12, 2018 at 5:59 PM, mohammad kashif <
>>>>>>>>>> kashif.alig at gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Milind, Vijay
>>>>>>>>>>>
>>>>>>>>>>> Thanks, I have some more information now as I straced glusterd
>>>>>>>>>>> on client
>>>>>>>>>>>
>>>>>>>>>>> 138544      0.000131 mprotect(0x7f2f70785000, 4096,
>>>>>>>>>>> PROT_READ|PROT_WRITE) = 0 <0.000026>
>>>>>>>>>>> 138544      0.000128 mprotect(0x7f2f70786000, 4096,
>>>>>>>>>>> PROT_READ|PROT_WRITE) = 0 <0.000027>
>>>>>>>>>>> 138544      0.000126 mprotect(0x7f2f70787000, 4096,
>>>>>>>>>>> PROT_READ|PROT_WRITE) = 0 <0.000027>
>>>>>>>>>>> 138544      0.000124 --- SIGSEGV {si_signo=SIGSEGV,
>>>>>>>>>>> si_code=SEGV_ACCERR, si_addr=0x7f2f7c60ef88} ---
>>>>>>>>>>> 138544      0.000051 --- SIGSEGV {si_signo=SIGSEGV,
>>>>>>>>>>> si_code=SI_KERNEL, si_addr=0} ---
>>>>>>>>>>> 138551      0.105048 +++ killed by SIGSEGV (core dumped) +++
>>>>>>>>>>> 138550      0.000041 +++ killed by SIGSEGV (core dumped) +++
>>>>>>>>>>> 138547      0.000008 +++ killed by SIGSEGV (core dumped) +++
>>>>>>>>>>> 138546      0.000007 +++ killed by SIGSEGV (core dumped) +++
>>>>>>>>>>> 138545      0.000007 +++ killed by SIGSEGV (core dumped) +++
>>>>>>>>>>> 138544      0.000008 +++ killed by SIGSEGV (core dumped) +++
>>>>>>>>>>> 138543      0.000007 +++ killed by SIGSEGV (core dumped) +++
>>>>>>>>>>>
>>>>>>>>>>> As for I understand that somehow gluster is trying to access
>>>>>>>>>>> memory in appropriate manner and kernel sends SIGSEGV
>>>>>>>>>>>
>>>>>>>>>>> I also got the core dump. I am trying gdb first time so I am not
>>>>>>>>>>> sure whether I am using it correctly
>>>>>>>>>>>
>>>>>>>>>>> gdb /usr/sbin/glusterfs core.138536
>>>>>>>>>>>
>>>>>>>>>>> It just tell me that program terminated with signal 11,
>>>>>>>>>>> segmentation fault .
>>>>>>>>>>>
>>>>>>>>>>> The problem is not limited to one client but happening to many
>>>>>>>>>>> clients.
>>>>>>>>>>>
>>>>>>>>>>> I will really appreciate any help as whole file system has
>>>>>>>>>>> become unusable
>>>>>>>>>>>
>>>>>>>>>>> Thanks
>>>>>>>>>>>
>>>>>>>>>>> Kashif
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Jun 12, 2018 at 12:26 PM, Milind Changire <
>>>>>>>>>>> mchangir at redhat.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Kashif,
>>>>>>>>>>>> You can change the log level by:
>>>>>>>>>>>> $ gluster volume set <vol> diagnostics.brick-log-level TRACE
>>>>>>>>>>>> $ gluster volume set <vol> diagnostics.client-log-level TRACE
>>>>>>>>>>>>
>>>>>>>>>>>> and see how things fare
>>>>>>>>>>>>
>>>>>>>>>>>> If you want fewer logs you can change the log-level to DEBUG
>>>>>>>>>>>> instead of TRACE.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Jun 12, 2018 at 3:37 PM, mohammad kashif <
>>>>>>>>>>>> kashif.alig at gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Vijay
>>>>>>>>>>>>>
>>>>>>>>>>>>> Now it is unmounting every 30 mins !
>>>>>>>>>>>>>
>>>>>>>>>>>>> The server log at /var/log/glusterfs/bricks/glusteratlas-brics001-gv0.log
>>>>>>>>>>>>> have this line only
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2018-06-12 09:53:19.303102] I [MSGID: 115013]
>>>>>>>>>>>>> [server-helpers.c:289:do_fd_cleanup] 0-atlasglust-server: fd
>>>>>>>>>>>>> cleanup on /atlas/atlasdata/zgubic/hmumu/
>>>>>>>>>>>>> histograms/v14.3/Signal
>>>>>>>>>>>>> [2018-06-12 09:53:19.306190] I [MSGID: 101055]
>>>>>>>>>>>>> [client_t.c:443:gf_client_unref] 0-atlasglust-server:
>>>>>>>>>>>>> Shutting down connection <server-name> -2224879-2018/06/12-09:51:01:4
>>>>>>>>>>>>> 60889-atlasglust-client-0-0-0
>>>>>>>>>>>>>
>>>>>>>>>>>>> There is no other information. Is there any way to increase
>>>>>>>>>>>>> log verbosity?
>>>>>>>>>>>>>
>>>>>>>>>>>>> on the client
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2018-06-12 09:51:01.744980] I [MSGID: 114057]
>>>>>>>>>>>>> [client-handshake.c:1478:select_server_supported_programs]
>>>>>>>>>>>>> 0-atlasglust-client-5: Using Program GlusterFS 3.3, Num (1298437), Version
>>>>>>>>>>>>> (330)
>>>>>>>>>>>>> [2018-06-12 09:51:01.746508] I [MSGID: 114046]
>>>>>>>>>>>>> [client-handshake.c:1231:client_setvolume_cbk]
>>>>>>>>>>>>> 0-atlasglust-client-5: Connected to atlasglust-client-5, attached to remote
>>>>>>>>>>>>> volume '/glusteratlas/brick006/gv0'.
>>>>>>>>>>>>> [2018-06-12 09:51:01.746543] I [MSGID: 114047]
>>>>>>>>>>>>> [client-handshake.c:1242:client_setvolume_cbk]
>>>>>>>>>>>>> 0-atlasglust-client-5: Server and Client lk-version numbers are not same,
>>>>>>>>>>>>> reopening the fds
>>>>>>>>>>>>> [2018-06-12 09:51:01.746814] I [MSGID: 114035]
>>>>>>>>>>>>> [client-handshake.c:202:client_set_lk_version_cbk]
>>>>>>>>>>>>> 0-atlasglust-client-5: Server lk version = 1
>>>>>>>>>>>>> [2018-06-12 09:51:01.748449] I [MSGID: 114057]
>>>>>>>>>>>>> [client-handshake.c:1478:select_server_supported_programs]
>>>>>>>>>>>>> 0-atlasglust-client-6: Using Program GlusterFS 3.3, Num (1298437), Version
>>>>>>>>>>>>> (330)
>>>>>>>>>>>>> [2018-06-12 09:51:01.750219] I [MSGID: 114046]
>>>>>>>>>>>>> [client-handshake.c:1231:client_setvolume_cbk]
>>>>>>>>>>>>> 0-atlasglust-client-6: Connected to atlasglust-client-6, attached to remote
>>>>>>>>>>>>> volume '/glusteratlas/brick007/gv0'.
>>>>>>>>>>>>> [2018-06-12 09:51:01.750261] I [MSGID: 114047]
>>>>>>>>>>>>> [client-handshake.c:1242:client_setvolume_cbk]
>>>>>>>>>>>>> 0-atlasglust-client-6: Server and Client lk-version numbers are not same,
>>>>>>>>>>>>> reopening the fds
>>>>>>>>>>>>> [2018-06-12 09:51:01.750503] I [MSGID: 114035]
>>>>>>>>>>>>> [client-handshake.c:202:client_set_lk_version_cbk]
>>>>>>>>>>>>> 0-atlasglust-client-6: Server lk version = 1
>>>>>>>>>>>>> [2018-06-12 09:51:01.752207] I [fuse-bridge.c:4205:fuse_init]
>>>>>>>>>>>>> 0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.24 kernel
>>>>>>>>>>>>> 7.14
>>>>>>>>>>>>> [2018-06-12 09:51:01.752261] I [fuse-bridge.c:4835:fuse_graph_sync]
>>>>>>>>>>>>> 0-fuse: switched to graph 0
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> is there a problem with server and client 1k version?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks for your help.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Kashif
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Mon, Jun 11, 2018 at 11:52 PM, Vijay Bellur <
>>>>>>>>>>>>> vbellur at redhat.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Mon, Jun 11, 2018 at 8:50 AM, mohammad kashif <
>>>>>>>>>>>>>> kashif.alig at gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Since I have updated our gluster server and client to latest
>>>>>>>>>>>>>>> version 3.12.9-1, I am having this issue of gluster getting unmounted from
>>>>>>>>>>>>>>> client very regularly. It was not a problem before update.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Its a distributed file system with no replication. We have
>>>>>>>>>>>>>>> seven servers totaling around 480TB data. Its 97% full.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I am using following config on server
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> gluster volume set atlasglust features.cache-invalidation on
>>>>>>>>>>>>>>> gluster volume set atlasglust features.cache-invalidation-timeout
>>>>>>>>>>>>>>> 600
>>>>>>>>>>>>>>> gluster volume set atlasglust performance.stat-prefetch on
>>>>>>>>>>>>>>> gluster volume set atlasglust performance.cache-invalidation
>>>>>>>>>>>>>>> on
>>>>>>>>>>>>>>> gluster volume set atlasglust performance.md-cache-timeout
>>>>>>>>>>>>>>> 600
>>>>>>>>>>>>>>> gluster volume set atlasglust performance.parallel-readdir on
>>>>>>>>>>>>>>> gluster volume set atlasglust performance.cache-size 1GB
>>>>>>>>>>>>>>> gluster volume set atlasglust performance.client-io-threads
>>>>>>>>>>>>>>> on
>>>>>>>>>>>>>>> gluster volume set atlasglust cluster.lookup-optimize on
>>>>>>>>>>>>>>> gluster volume set atlasglust performance.stat-prefetch on
>>>>>>>>>>>>>>> gluster volume set atlasglust client.event-threads 4
>>>>>>>>>>>>>>> gluster volume set atlasglust server.event-threads 4
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> clients are mounted with this option
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> defaults,direct-io-mode=disable,attribute-timeout=600,entry-
>>>>>>>>>>>>>>> timeout=600,negative-timeout=600,fopen-keep-cache,rw,_netdev
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I can't see anything in the log file. Can someone suggest
>>>>>>>>>>>>>>> that how to troubleshoot this issue?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Can you please share the log file? Checking for messages
>>>>>>>>>>>>>> related to disconnections/crashes in the log file would be a good way to
>>>>>>>>>>>>>> start troubleshooting the problem.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> Vijay
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>> Gluster-users mailing list
>>>>>>>>>>>>> Gluster-users at gluster.org
>>>>>>>>>>>>> http://lists.gluster.org/mailman/listinfo/gluster-users
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Milind
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Milind
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Milind
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Milind
>>>>
>>>>
>>>
>>
>>
>> --
>> Milind
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20180614/4112d9be/attachment.html>


More information about the Gluster-users mailing list