[Gluster-users] Client un-mounting since upgrade to 3.12.9-1 version

mohammad kashif kashif.alig at gmail.com
Thu Jun 14 11:12:47 UTC 2018


Hi Nithya

It seems that problem can be solved by either turning parallel-readir off
or downgrading client to 3.10.12-1 . Yesterday I downgraded some clients to
3.10.12-1 and it seems to fixed the problem. Today when I saw your email
then I disabled parallel-readir off and the current client 3.12.9-1
started  to work.   I upgraded server and clients to 3.12.9-1 last month
and since then clients were intermittently unmounting once in a week. But
during last three days, it started unmounting every few minutes. I don't
know that what triggered this sudden panic except that file system was
quite full; around 98%. It is 480 TB file system. The file system has
almost 80 Million files.

Servers have 64GB RAM and clients have 64GB to 192GB RAM. I tested with
192GB RAM client and it still had the same issue.


Volume Name: atlasglust
Type: Distribute
Volume ID: fbf0ebb8-deab-4388-9d8a-f722618a624b
Status: Started
Snapshot Count: 0
Number of Bricks: 7
Transport-type: tcp
Bricks:
Brick1: pplxgluster01.X.Y.Z/glusteratlas/brick001/gv0
Brick2: pplxgluster02.X.Y.Z:/glusteratlas/brick002/gv0
Brick3: pplxgluster03.X.Y.Z:/glusteratlas/brick003/gv0
Brick4: pplxgluster04.X.Y.Z:/glusteratlas/brick004/gv0
Brick5: pplxgluster05.X.Y.Z:/glusteratlas/brick005/gv0
Brick6: pplxgluster06.X.Y.Z:/glusteratlas/brick006/gv0
Brick7: pplxgluster07.X.Y.Z:/glusteratlas/brick007/gv0
Options Reconfigured:
diagnostics.client-log-level: ERROR
diagnostics.brick-log-level: ERROR
performance.cache-invalidation: on
server.event-threads: 4
client.event-threads: 4
cluster.lookup-optimize: on
performance.client-io-threads: on
performance.cache-size: 1GB
performance.parallel-readdir: off
performance.md-cache-timeout: 600
performance.stat-prefetch: on
features.cache-invalidation-timeout: 600
features.cache-invalidation: on
auth.allow: X.Y.Z.*
transport.address-family: inet
performance.readdir-ahead: on
nfs.disable: on


Thanks

Kashif

On Thu, Jun 14, 2018 at 5:39 AM, Nithya Balachandran <nbalacha at redhat.com>
wrote:

> +Poornima who works on parallel-readdir.
>
> @Poornima, Have you seen anything like this before?
>
> On 14 June 2018 at 10:07, Nithya Balachandran <nbalacha at redhat.com> wrote:
>
>> This is not the same issue as the one you are referring - that was in the
>> RPC layer and caused the bricks to crash. This one is different as it seems
>> to be in the dht and rda layers. It does look like a stack overflow though.
>>
>> @Mohammad,
>>
>> Please send the following information:
>>
>> 1. gluster volume info
>> 2. The number of entries in the directory being listed
>> 3. System memory
>>
>> Does this still happen if you turn off parallel-readdir?
>>
>> Regards,
>> Nithya
>>
>>
>>
>>
>> On 13 June 2018 at 16:40, Milind Changire <mchangir at redhat.com> wrote:
>>
>>> +Nithya
>>>
>>> Nithya,
>>> Do these logs [1]  look similar to the recursive readdir() issue that
>>> you encountered just a while back ?
>>> i.e. recursive readdir() response definition in the XDR
>>>
>>> [1] http://www-pnp.physics.ox.ac.uk/~mohammad/backtrace.log
>>>
>>>
>>> On Wed, Jun 13, 2018 at 4:29 PM, mohammad kashif <kashif.alig at gmail.com>
>>> wrote:
>>>
>>>> Hi Milind
>>>>
>>>> Thanks a lot, I manage to run gdb and produced traceback as well. Its
>>>> here
>>>>
>>>> http://www-pnp.physics.ox.ac.uk/~mohammad/backtrace.log
>>>>
>>>>
>>>> I am trying to understand but still not able to make sense out of it.
>>>>
>>>> Thanks
>>>>
>>>> Kashif
>>>>
>>>> On Wed, Jun 13, 2018 at 11:34 AM, Milind Changire <mchangir at redhat.com>
>>>> wrote:
>>>>
>>>>> Kashif,
>>>>> FYI: http://debuginfo.centos.org/centos/6/storage/x86_64/
>>>>>
>>>>>
>>>>> On Wed, Jun 13, 2018 at 3:21 PM, mohammad kashif <
>>>>> kashif.alig at gmail.com> wrote:
>>>>>
>>>>>> Hi Milind
>>>>>>
>>>>>> There is no glusterfs-debuginfo available for gluster-3.12 from
>>>>>> http://mirror.centos.org/centos/6/storage/x86_64/gluster-3.12/ repo.
>>>>>> Do you know from where I can get it?
>>>>>> Also when I run gdb, it says
>>>>>>
>>>>>> Missing separate debuginfos, use: debuginfo-install
>>>>>> glusterfs-fuse-3.12.9-1.el6.x86_64
>>>>>>
>>>>>> I can't find debug package for glusterfs-fuse either
>>>>>>
>>>>>> Thanks from the pit of despair ;)
>>>>>>
>>>>>> Kashif
>>>>>>
>>>>>>
>>>>>> On Tue, Jun 12, 2018 at 5:01 PM, mohammad kashif <
>>>>>> kashif.alig at gmail.com> wrote:
>>>>>>
>>>>>>> Hi Milind
>>>>>>>
>>>>>>> I will send you links for logs.
>>>>>>>
>>>>>>> I collected these core dumps at client and there is no glusterd
>>>>>>> process running on client.
>>>>>>>
>>>>>>> Kashif
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Jun 12, 2018 at 4:14 PM, Milind Changire <
>>>>>>> mchangir at redhat.com> wrote:
>>>>>>>
>>>>>>>> Kashif,
>>>>>>>> Could you also send over the client/mount log file as Vijay
>>>>>>>> suggested ?
>>>>>>>> Or maybe the lines with the crash backtrace lines
>>>>>>>>
>>>>>>>> Also, you've mentioned that you straced glusterd, but when you ran
>>>>>>>> gdb, you ran it over /usr/sbin/glusterfs
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Jun 12, 2018 at 8:19 PM, Vijay Bellur <vbellur at redhat.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Jun 12, 2018 at 7:40 AM, mohammad kashif <
>>>>>>>>> kashif.alig at gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Milind
>>>>>>>>>>
>>>>>>>>>> The operating system is Scientific Linux 6 which is based on
>>>>>>>>>> RHEL6. The cpu arch is Intel x86_64.
>>>>>>>>>>
>>>>>>>>>> I will send you a separate email with link to core dump.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> You could also grep for crash in the client log file and the lines
>>>>>>>>> following crash would have a backtrace in most cases.
>>>>>>>>>
>>>>>>>>> HTH,
>>>>>>>>> Vijay
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Thanks for your help.
>>>>>>>>>>
>>>>>>>>>> Kashif
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Tue, Jun 12, 2018 at 3:16 PM, Milind Changire <
>>>>>>>>>> mchangir at redhat.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Kashif,
>>>>>>>>>>> Could you share the core dump via Google Drive or something
>>>>>>>>>>> similar
>>>>>>>>>>>
>>>>>>>>>>> Also, let me know the CPU arch and OS Distribution on which you
>>>>>>>>>>> are running gluster.
>>>>>>>>>>>
>>>>>>>>>>> If you've installed the glusterfs-debuginfo package, you'll also
>>>>>>>>>>> get the source lines in the backtrace via gdb
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Jun 12, 2018 at 5:59 PM, mohammad kashif <
>>>>>>>>>>> kashif.alig at gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Milind, Vijay
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks, I have some more information now as I straced glusterd
>>>>>>>>>>>> on client
>>>>>>>>>>>>
>>>>>>>>>>>> 138544      0.000131 mprotect(0x7f2f70785000, 4096,
>>>>>>>>>>>> PROT_READ|PROT_WRITE) = 0 <0.000026>
>>>>>>>>>>>> 138544      0.000128 mprotect(0x7f2f70786000, 4096,
>>>>>>>>>>>> PROT_READ|PROT_WRITE) = 0 <0.000027>
>>>>>>>>>>>> 138544      0.000126 mprotect(0x7f2f70787000, 4096,
>>>>>>>>>>>> PROT_READ|PROT_WRITE) = 0 <0.000027>
>>>>>>>>>>>> 138544      0.000124 --- SIGSEGV {si_signo=SIGSEGV,
>>>>>>>>>>>> si_code=SEGV_ACCERR, si_addr=0x7f2f7c60ef88} ---
>>>>>>>>>>>> 138544      0.000051 --- SIGSEGV {si_signo=SIGSEGV,
>>>>>>>>>>>> si_code=SI_KERNEL, si_addr=0} ---
>>>>>>>>>>>> 138551      0.105048 +++ killed by SIGSEGV (core dumped) +++
>>>>>>>>>>>> 138550      0.000041 +++ killed by SIGSEGV (core dumped) +++
>>>>>>>>>>>> 138547      0.000008 +++ killed by SIGSEGV (core dumped) +++
>>>>>>>>>>>> 138546      0.000007 +++ killed by SIGSEGV (core dumped) +++
>>>>>>>>>>>> 138545      0.000007 +++ killed by SIGSEGV (core dumped) +++
>>>>>>>>>>>> 138544      0.000008 +++ killed by SIGSEGV (core dumped) +++
>>>>>>>>>>>> 138543      0.000007 +++ killed by SIGSEGV (core dumped) +++
>>>>>>>>>>>>
>>>>>>>>>>>> As for I understand that somehow gluster is trying to access
>>>>>>>>>>>> memory in appropriate manner and kernel sends SIGSEGV
>>>>>>>>>>>>
>>>>>>>>>>>> I also got the core dump. I am trying gdb first time so I am
>>>>>>>>>>>> not sure whether I am using it correctly
>>>>>>>>>>>>
>>>>>>>>>>>> gdb /usr/sbin/glusterfs core.138536
>>>>>>>>>>>>
>>>>>>>>>>>> It just tell me that program terminated with signal 11,
>>>>>>>>>>>> segmentation fault .
>>>>>>>>>>>>
>>>>>>>>>>>> The problem is not limited to one client but happening to many
>>>>>>>>>>>> clients.
>>>>>>>>>>>>
>>>>>>>>>>>> I will really appreciate any help as whole file system has
>>>>>>>>>>>> become unusable
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks
>>>>>>>>>>>>
>>>>>>>>>>>> Kashif
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Jun 12, 2018 at 12:26 PM, Milind Changire <
>>>>>>>>>>>> mchangir at redhat.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Kashif,
>>>>>>>>>>>>> You can change the log level by:
>>>>>>>>>>>>> $ gluster volume set <vol> diagnostics.brick-log-level TRACE
>>>>>>>>>>>>> $ gluster volume set <vol> diagnostics.client-log-level TRACE
>>>>>>>>>>>>>
>>>>>>>>>>>>> and see how things fare
>>>>>>>>>>>>>
>>>>>>>>>>>>> If you want fewer logs you can change the log-level to DEBUG
>>>>>>>>>>>>> instead of TRACE.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, Jun 12, 2018 at 3:37 PM, mohammad kashif <
>>>>>>>>>>>>> kashif.alig at gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Vijay
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Now it is unmounting every 30 mins !
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The server log at /var/log/glusterfs/bricks/glusteratlas-brics001-gv0.log
>>>>>>>>>>>>>> have this line only
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2018-06-12 09:53:19.303102] I [MSGID: 115013]
>>>>>>>>>>>>>> [server-helpers.c:289:do_fd_cleanup] 0-atlasglust-server: fd
>>>>>>>>>>>>>> cleanup on /atlas/atlasdata/zgubic/hmumu/
>>>>>>>>>>>>>> histograms/v14.3/Signal
>>>>>>>>>>>>>> [2018-06-12 09:53:19.306190] I [MSGID: 101055]
>>>>>>>>>>>>>> [client_t.c:443:gf_client_unref] 0-atlasglust-server:
>>>>>>>>>>>>>> Shutting down connection <server-name> -2224879-2018/06/12-09:51:01:4
>>>>>>>>>>>>>> 60889-atlasglust-client-0-0-0
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> There is no other information. Is there any way to increase
>>>>>>>>>>>>>> log verbosity?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> on the client
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2018-06-12 09:51:01.744980] I [MSGID: 114057]
>>>>>>>>>>>>>> [client-handshake.c:1478:select_server_supported_programs]
>>>>>>>>>>>>>> 0-atlasglust-client-5: Using Program GlusterFS 3.3, Num (1298437), Version
>>>>>>>>>>>>>> (330)
>>>>>>>>>>>>>> [2018-06-12 09:51:01.746508] I [MSGID: 114046]
>>>>>>>>>>>>>> [client-handshake.c:1231:client_setvolume_cbk]
>>>>>>>>>>>>>> 0-atlasglust-client-5: Connected to atlasglust-client-5, attached to remote
>>>>>>>>>>>>>> volume '/glusteratlas/brick006/gv0'.
>>>>>>>>>>>>>> [2018-06-12 09:51:01.746543] I [MSGID: 114047]
>>>>>>>>>>>>>> [client-handshake.c:1242:client_setvolume_cbk]
>>>>>>>>>>>>>> 0-atlasglust-client-5: Server and Client lk-version numbers are not same,
>>>>>>>>>>>>>> reopening the fds
>>>>>>>>>>>>>> [2018-06-12 09:51:01.746814] I [MSGID: 114035]
>>>>>>>>>>>>>> [client-handshake.c:202:client_set_lk_version_cbk]
>>>>>>>>>>>>>> 0-atlasglust-client-5: Server lk version = 1
>>>>>>>>>>>>>> [2018-06-12 09:51:01.748449] I [MSGID: 114057]
>>>>>>>>>>>>>> [client-handshake.c:1478:select_server_supported_programs]
>>>>>>>>>>>>>> 0-atlasglust-client-6: Using Program GlusterFS 3.3, Num (1298437), Version
>>>>>>>>>>>>>> (330)
>>>>>>>>>>>>>> [2018-06-12 09:51:01.750219] I [MSGID: 114046]
>>>>>>>>>>>>>> [client-handshake.c:1231:client_setvolume_cbk]
>>>>>>>>>>>>>> 0-atlasglust-client-6: Connected to atlasglust-client-6, attached to remote
>>>>>>>>>>>>>> volume '/glusteratlas/brick007/gv0'.
>>>>>>>>>>>>>> [2018-06-12 09:51:01.750261] I [MSGID: 114047]
>>>>>>>>>>>>>> [client-handshake.c:1242:client_setvolume_cbk]
>>>>>>>>>>>>>> 0-atlasglust-client-6: Server and Client lk-version numbers are not same,
>>>>>>>>>>>>>> reopening the fds
>>>>>>>>>>>>>> [2018-06-12 09:51:01.750503] I [MSGID: 114035]
>>>>>>>>>>>>>> [client-handshake.c:202:client_set_lk_version_cbk]
>>>>>>>>>>>>>> 0-atlasglust-client-6: Server lk version = 1
>>>>>>>>>>>>>> [2018-06-12 09:51:01.752207] I [fuse-bridge.c:4205:fuse_init]
>>>>>>>>>>>>>> 0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.24 kernel
>>>>>>>>>>>>>> 7.14
>>>>>>>>>>>>>> [2018-06-12 09:51:01.752261] I [fuse-bridge.c:4835:fuse_graph_sync]
>>>>>>>>>>>>>> 0-fuse: switched to graph 0
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> is there a problem with server and client 1k version?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks for your help.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Kashif
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Mon, Jun 11, 2018 at 11:52 PM, Vijay Bellur <
>>>>>>>>>>>>>> vbellur at redhat.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Mon, Jun 11, 2018 at 8:50 AM, mohammad kashif <
>>>>>>>>>>>>>>> kashif.alig at gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Since I have updated our gluster server and client to
>>>>>>>>>>>>>>>> latest version 3.12.9-1, I am having this issue of gluster getting
>>>>>>>>>>>>>>>> unmounted from client very regularly. It was not a problem before update.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Its a distributed file system with no replication. We have
>>>>>>>>>>>>>>>> seven servers totaling around 480TB data. Its 97% full.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I am using following config on server
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> gluster volume set atlasglust features.cache-invalidation on
>>>>>>>>>>>>>>>> gluster volume set atlasglust features.cache-invalidation-timeout
>>>>>>>>>>>>>>>> 600
>>>>>>>>>>>>>>>> gluster volume set atlasglust performance.stat-prefetch on
>>>>>>>>>>>>>>>> gluster volume set atlasglust
>>>>>>>>>>>>>>>> performance.cache-invalidation on
>>>>>>>>>>>>>>>> gluster volume set atlasglust performance.md-cache-timeout
>>>>>>>>>>>>>>>> 600
>>>>>>>>>>>>>>>> gluster volume set atlasglust performance.parallel-readdir
>>>>>>>>>>>>>>>> on
>>>>>>>>>>>>>>>> gluster volume set atlasglust performance.cache-size 1GB
>>>>>>>>>>>>>>>> gluster volume set atlasglust performance.client-io-threads
>>>>>>>>>>>>>>>> on
>>>>>>>>>>>>>>>> gluster volume set atlasglust cluster.lookup-optimize on
>>>>>>>>>>>>>>>> gluster volume set atlasglust performance.stat-prefetch on
>>>>>>>>>>>>>>>> gluster volume set atlasglust client.event-threads 4
>>>>>>>>>>>>>>>> gluster volume set atlasglust server.event-threads 4
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> clients are mounted with this option
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> defaults,direct-io-mode=disabl
>>>>>>>>>>>>>>>> e,attribute-timeout=600,entry-
>>>>>>>>>>>>>>>> timeout=600,negative-timeout=600,fopen-keep-cache,rw,_netdev
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I can't see anything in the log file. Can someone suggest
>>>>>>>>>>>>>>>> that how to troubleshoot this issue?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Can you please share the log file? Checking for messages
>>>>>>>>>>>>>>> related to disconnections/crashes in the log file would be a good way to
>>>>>>>>>>>>>>> start troubleshooting the problem.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>> Vijay
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>> Gluster-users mailing list
>>>>>>>>>>>>>> Gluster-users at gluster.org
>>>>>>>>>>>>>> http://lists.gluster.org/mailman/listinfo/gluster-users
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Milind
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Milind
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Milind
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Milind
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> Milind
>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20180614/0f4cb124/attachment.html>


More information about the Gluster-users mailing list