[Gluster-users] java application crushes while reading a zip file

Mon Jan 7 18:11:12 UTC 2019

This system is going into production.  I will try to replicate this problem
on the next installation.

On Wed, Jan 2, 2019 at 9:25 PM Raghavendra Gowdappa <rgowdapp at redhat.com>
wrote:

>
>
> On Wed, Jan 2, 2019 at 9:59 PM Dmitry Isakbayev <isakdim at gmail.com> wrote:
>
>> Still no JVM crushes.  Is it possible that running glusterfs with
>> performance options turned off for a couple of days cleared out the "stale
>> metadata issue"?
>>
>
> restarting these options, would've cleared the existing cache and hence
> previous stale metadata would've been cleared. Hitting stale metadata
> again  depends on races. That might be the reason you are still not seeing
> the issue. Can you try with enabling all perf xlators (default
> configuration)?
>
>
>>
>> On Mon, Dec 31, 2018 at 1:38 PM Dmitry Isakbayev <isakdim at gmail.com>
>> wrote:
>>
>>> The software ran with all of the options turned off over the weekend
>>> without any problems.
>>> I will try to collect the debug info for you.  I have re-enabled the 3
>>> three options, but yet to see the problem reoccurring.
>>>
>>>
>>> On Sat, Dec 29, 2018 at 6:46 PM Raghavendra Gowdappa <
>>> rgowdapp at redhat.com> wrote:
>>>
>>>> Thanks Dmitry. Can you provide the following debug info I asked earlier:
>>>>
>>>> * strace -ff -v ... of java application
>>>> * dump of the I/O traffic seen by the mountpoint (use --dump-fuse while
>>>> mounting).
>>>>
>>>> regards,
>>>> Raghavendra
>>>>
>>>> On Sat, Dec 29, 2018 at 2:08 AM Dmitry Isakbayev <isakdim at gmail.com>
>>>> wrote:
>>>>
>>>>> These 3 options seem to trigger both (reading zip file and renaming
>>>>> files) problems.
>>>>>
>>>>> Options Reconfigured:
>>>>> performance.io-cache: off
>>>>> performance.stat-prefetch: off
>>>>> performance.quick-read: off
>>>>> performance.parallel-readdir: off
>>>>> *performance.readdir-ahead: on*
>>>>> *performance.write-behind: on*
>>>>> *performance.read-ahead: on*
>>>>> performance.client-io-threads: off
>>>>> nfs.disable: on
>>>>> transport.address-family: inet
>>>>>
>>>>>
>>>>> On Fri, Dec 28, 2018 at 10:24 AM Dmitry Isakbayev <isakdim at gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Turning a single option on at a time still worked fine.  I will keep
>>>>>> trying.
>>>>>>
>>>>>> We had used 4.1.5 on KVM/CentOS7.5 at AWS without these issues or log
>>>>>> messages.  Do you suppose these issues are triggered by the new environment
>>>>>> or did not exist in 4.1.5?
>>>>>>
>>>>>> [root at node1 ~]# glusterfs --version
>>>>>> glusterfs 4.1.5
>>>>>>
>>>>>> On AWS using
>>>>>> [root at node1 ~]# hostnamectl
>>>>>>    Static hostname: node1
>>>>>>          Icon name: computer-vm
>>>>>>            Chassis: vm
>>>>>>         Machine ID: b30d0f2110ac3807b210c19ede3ce88f
>>>>>>            Boot ID: 52bb159a0aa94043a40e7c7651967bd9
>>>>>>     Virtualization: kvm
>>>>>>   Operating System: CentOS Linux 7 (Core)
>>>>>>        CPE OS Name: cpe:/o:centos:centos:7
>>>>>>             Kernel: Linux 3.10.0-862.3.2.el7.x86_64
>>>>>>       Architecture: x86-64
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Dec 28, 2018 at 8:56 AM Raghavendra Gowdappa <
>>>>>> rgowdapp at redhat.com> wrote:
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Dec 28, 2018 at 7:23 PM Dmitry Isakbayev <isakdim at gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Ok. I will try different options.
>>>>>>>>
>>>>>>>> This system is scheduled to go into production soon.  What version
>>>>>>>> would you recommend to roll back to?
>>>>>>>>
>>>>>>>
>>>>>>> These are long standing issues. So, rolling back may not make these
>>>>>>> issues go away. Instead if you think performance is agreeable to you,
>>>>>>> please keep these xlators off in production.
>>>>>>>
>>>>>>>
>>>>>>>> On Thu, Dec 27, 2018 at 10:55 PM Raghavendra Gowdappa <
>>>>>>>> rgowdapp at redhat.com> wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Dec 28, 2018 at 3:13 AM Dmitry Isakbayev <
>>>>>>>>> isakdim at gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Raghavendra,
>>>>>>>>>>
>>>>>>>>>> Thank  for the suggestion.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I am suing
>>>>>>>>>>
>>>>>>>>>> [root at jl-fanexoss1p glusterfs]# gluster --version
>>>>>>>>>> glusterfs 5.0
>>>>>>>>>>
>>>>>>>>>> On
>>>>>>>>>> [root at jl-fanexoss1p glusterfs]# hostnamectl
>>>>>>>>>>          Icon name: computer-vm
>>>>>>>>>>            Chassis: vm
>>>>>>>>>>         Machine ID: e44b8478ef7a467d98363614f4e50535
>>>>>>>>>>            Boot ID: eed98992fdda4c88bdd459a89101766b
>>>>>>>>>>     Virtualization: vmware
>>>>>>>>>>   Operating System: Red Hat Enterprise Linux Server 7.5 (Maipo)
>>>>>>>>>>        CPE OS Name: cpe:/o:redhat:enterprise_linux:7.5:GA:server
>>>>>>>>>>             Kernel: Linux 3.10.0-862.14.4.el7.x86_64
>>>>>>>>>>       Architecture: x86-64
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I have configured the following options
>>>>>>>>>>
>>>>>>>>>> [root at jl-fanexoss1p glusterfs]# gluster volume info
>>>>>>>>>> Volume Name: gv0
>>>>>>>>>> Type: Replicate
>>>>>>>>>> Volume ID: 5ffbda09-c5e2-4abc-b89e-79b5d8a40824
>>>>>>>>>> Status: Started
>>>>>>>>>> Snapshot Count: 0
>>>>>>>>>> Number of Bricks: 1 x 3 = 3
>>>>>>>>>> Transport-type: tcp
>>>>>>>>>> Bricks:
>>>>>>>>>> Brick1: jl-fanexoss1p.cspire.net:/data/brick1/gv0
>>>>>>>>>> Brick2: sl-fanexoss2p.cspire.net:/data/brick1/gv0
>>>>>>>>>> Brick3: nxquorum1p.cspire.net:/data/brick1/gv0
>>>>>>>>>> Options Reconfigured:
>>>>>>>>>> performance.io-cache: off
>>>>>>>>>> performance.stat-prefetch: off
>>>>>>>>>> performance.quick-read: off
>>>>>>>>>> performance.parallel-readdir: off
>>>>>>>>>> performance.readdir-ahead: off
>>>>>>>>>> performance.write-behind: off
>>>>>>>>>> performance.read-ahead: off
>>>>>>>>>> performance.client-io-threads: off
>>>>>>>>>> nfs.disable: on
>>>>>>>>>> transport.address-family: inet
>>>>>>>>>>
>>>>>>>>>> I don't know if it is related, but I am seeing a lot of
>>>>>>>>>> [2018-12-27 20:19:23.776080] W [MSGID: 114031]
>>>>>>>>>> [client-rpc-fops_v2.c:1932:client4_0_seek_cbk] 2-gv0-client-0: remote
>>>>>>>>>> operation failed [No such device or address]
>>>>>>>>>> [2018-12-27 20:19:47.735190] E [MSGID: 101191]
>>>>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch
>>>>>>>>>> handler
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> These msgs were introduced by patch [1]. To the best of my
>>>>>>>>> knowledge they are benign. We'll be sending a patch to fix these msgs
>>>>>>>>> though.
>>>>>>>>>
>>>>>>>>> +Mohit Agrawal <moagrawa at redhat.com> +Milind Changire
>>>>>>>>> <mchangir at redhat.com> . Can you try to identify why we are seeing
>>>>>>>>> these messages? If possible please send a patch to fix this.
>>>>>>>>>
>>>>>>>>> [1]
>>>>>>>>> https://review.gluster.org/r/I578c3fc67713f4234bd3abbec5d3fbba19059ea5
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> And java.io exceptions trying to rename files.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> When you see the errors is it possible to collect,
>>>>>>>>> * strace of the java application (strace -ff -v ...)
>>>>>>>>> * fuse-dump of the glusterfs mount (use option --dump-fuse while
>>>>>>>>> mounting)?
>>>>>>>>>
>>>>>>>>> I also need another favour from you. By trail and error, can you
>>>>>>>>> point out which of the many performance xlators you've turned off is
>>>>>>>>> causing the issue?
>>>>>>>>>
>>>>>>>>> The above two data-points will help us to fix the problem.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> Thank You,
>>>>>>>>>> Dmitry
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, Dec 27, 2018 at 3:48 PM Raghavendra Gowdappa <
>>>>>>>>>> rgowdapp at redhat.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> What version of glusterfs are you using? It might be either
>>>>>>>>>>> * a stale metadata issue.
>>>>>>>>>>> * inconsistent ctime issue.
>>>>>>>>>>>
>>>>>>>>>>> Can you try turning off all performance xlators? If the issue is
>>>>>>>>>>> 1, that should help.
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Dec 28, 2018 at 1:51 AM Dmitry Isakbayev <
>>>>>>>>>>> isakdim at gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Attempted to set 'performance.read-ahead off` according to
>>>>>>>>>>>> https://jira.apache.org/jira/browse/AMQ-7041
>>>>>>>>>>>> That did not help.
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Dec 24, 2018 at 2:11 PM Dmitry Isakbayev <
>>>>>>>>>>>> isakdim at gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> The core file generated by JVM suggests that it happens
>>>>>>>>>>>>> because the file is changing while it is being read -
>>>>>>>>>>>>> https://bugs.java.com/bugdatabase/view_bug.do?bug_id=8186557.
>>>>>>>>>>>>> The application reads in the zipfile and goes through the zip
>>>>>>>>>>>>> entries, then reloads the file and goes the zip entries again.  It does so
>>>>>>>>>>>>> 3 times.  The application never crushes on the 1st cycle but sometimes
>>>>>>>>>>>>> crushes on the 2nd or 3rd cycle.
>>>>>>>>>>>>> The zip file is generated about 20 seconds prior to it being
>>>>>>>>>>>>> used and is not updated or even used by any other application.  I have
>>>>>>>>>>>>> never seen this problem on a plain file system.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I would appreciate any suggestions on how to go debugging this
>>>>>>>>>>>>> issue.  I can change the source code of the java application.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>> Dmitry
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> Gluster-users mailing list
>>>>>>>>>>>> Gluster-users at gluster.org
>>>>>>>>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users
>>>>>>>>>>>
>>>>>>>>>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20190107/639ae27f/attachment.html>