[Gluster-users] java application crushes while reading a zip file

Mon Jan 28 15:39:55 UTC 2019

Amar,

Thank you for helping me troubleshoot the issues.  I don't have the
resources to test the software at this point, but I will keep it in mind.

Regards,
Dmitry

On Tue, Jan 22, 2019 at 1:02 AM Amar Tumballi Suryanarayan <
atumball at redhat.com> wrote:

> Dmitry,
>
> Thanks for the detailed updates on this thread. Let us know how your
> 'production' setup is running. For much smoother next upgrade, we request
> you to help out with some early testing of glusterfs-6 RC builds which are
> expected to be out by Feb 1st week.
>
> Also, if it is possible for you to automate the tests, it would be great
> to have it in our regression, so we can always be sure your setup would
> never break in future releases.
>
> Regards,
> Amar
>
> On Mon, Jan 7, 2019 at 11:42 PM Dmitry Isakbayev <isakdim at gmail.com>
> wrote:
>
>> This system is going into production.  I will try to replicate this
>> problem on the next installation.
>>
>> On Wed, Jan 2, 2019 at 9:25 PM Raghavendra Gowdappa <rgowdapp at redhat.com>
>> wrote:
>>
>>>
>>>
>>> On Wed, Jan 2, 2019 at 9:59 PM Dmitry Isakbayev <isakdim at gmail.com>
>>> wrote:
>>>
>>>> Still no JVM crushes.  Is it possible that running glusterfs with
>>>> performance options turned off for a couple of days cleared out the "stale
>>>> metadata issue"?
>>>>
>>>
>>> restarting these options, would've cleared the existing cache and hence
>>> previous stale metadata would've been cleared. Hitting stale metadata
>>> again  depends on races. That might be the reason you are still not seeing
>>> the issue. Can you try with enabling all perf xlators (default
>>> configuration)?
>>>
>>>
>>>>
>>>> On Mon, Dec 31, 2018 at 1:38 PM Dmitry Isakbayev <isakdim at gmail.com>
>>>> wrote:
>>>>
>>>>> The software ran with all of the options turned off over the weekend
>>>>> without any problems.
>>>>> I will try to collect the debug info for you.  I have re-enabled the 3
>>>>> three options, but yet to see the problem reoccurring.
>>>>>
>>>>>
>>>>> On Sat, Dec 29, 2018 at 6:46 PM Raghavendra Gowdappa <
>>>>> rgowdapp at redhat.com> wrote:
>>>>>
>>>>>> Thanks Dmitry. Can you provide the following debug info I asked
>>>>>> earlier:
>>>>>>
>>>>>> * strace -ff -v ... of java application
>>>>>> * dump of the I/O traffic seen by the mountpoint (use --dump-fuse
>>>>>> while mounting).
>>>>>>
>>>>>> regards,
>>>>>> Raghavendra
>>>>>>
>>>>>> On Sat, Dec 29, 2018 at 2:08 AM Dmitry Isakbayev <isakdim at gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> These 3 options seem to trigger both (reading zip file and renaming
>>>>>>> files) problems.
>>>>>>>
>>>>>>> Options Reconfigured:
>>>>>>> performance.io-cache: off
>>>>>>> performance.stat-prefetch: off
>>>>>>> performance.quick-read: off
>>>>>>> performance.parallel-readdir: off
>>>>>>> *performance.readdir-ahead: on*
>>>>>>> *performance.write-behind: on*
>>>>>>> *performance.read-ahead: on*
>>>>>>> performance.client-io-threads: off
>>>>>>> nfs.disable: on
>>>>>>> transport.address-family: inet
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Dec 28, 2018 at 10:24 AM Dmitry Isakbayev <isakdim at gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Turning a single option on at a time still worked fine.  I will
>>>>>>>> keep trying.
>>>>>>>>
>>>>>>>> We had used 4.1.5 on KVM/CentOS7.5 at AWS without these issues or
>>>>>>>> log messages.  Do you suppose these issues are triggered by the new
>>>>>>>> environment or did not exist in 4.1.5?
>>>>>>>>
>>>>>>>> [root at node1 ~]# glusterfs --version
>>>>>>>> glusterfs 4.1.5
>>>>>>>>
>>>>>>>> On AWS using
>>>>>>>> [root at node1 ~]# hostnamectl
>>>>>>>>    Static hostname: node1
>>>>>>>>          Icon name: computer-vm
>>>>>>>>            Chassis: vm
>>>>>>>>         Machine ID: b30d0f2110ac3807b210c19ede3ce88f
>>>>>>>>            Boot ID: 52bb159a0aa94043a40e7c7651967bd9
>>>>>>>>     Virtualization: kvm
>>>>>>>>   Operating System: CentOS Linux 7 (Core)
>>>>>>>>        CPE OS Name: cpe:/o:centos:centos:7
>>>>>>>>             Kernel: Linux 3.10.0-862.3.2.el7.x86_64
>>>>>>>>       Architecture: x86-64
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Dec 28, 2018 at 8:56 AM Raghavendra Gowdappa <
>>>>>>>> rgowdapp at redhat.com> wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Dec 28, 2018 at 7:23 PM Dmitry Isakbayev <
>>>>>>>>> isakdim at gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Ok. I will try different options.
>>>>>>>>>>
>>>>>>>>>> This system is scheduled to go into production soon.  What
>>>>>>>>>> version would you recommend to roll back to?
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> These are long standing issues. So, rolling back may not make
>>>>>>>>> these issues go away. Instead if you think performance is agreeable to you,
>>>>>>>>> please keep these xlators off in production.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> On Thu, Dec 27, 2018 at 10:55 PM Raghavendra Gowdappa <
>>>>>>>>>> rgowdapp at redhat.com> wrote:
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Dec 28, 2018 at 3:13 AM Dmitry Isakbayev <
>>>>>>>>>>> isakdim at gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Raghavendra,
>>>>>>>>>>>>
>>>>>>>>>>>> Thank  for the suggestion.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> I am suing
>>>>>>>>>>>>
>>>>>>>>>>>> [root at jl-fanexoss1p glusterfs]# gluster --version
>>>>>>>>>>>> glusterfs 5.0
>>>>>>>>>>>>
>>>>>>>>>>>> On
>>>>>>>>>>>> [root at jl-fanexoss1p glusterfs]# hostnamectl
>>>>>>>>>>>>          Icon name: computer-vm
>>>>>>>>>>>>            Chassis: vm
>>>>>>>>>>>>         Machine ID: e44b8478ef7a467d98363614f4e50535
>>>>>>>>>>>>            Boot ID: eed98992fdda4c88bdd459a89101766b
>>>>>>>>>>>>     Virtualization: vmware
>>>>>>>>>>>>   Operating System: Red Hat Enterprise Linux Server 7.5 (Maipo)
>>>>>>>>>>>>        CPE OS Name: cpe:/o:redhat:enterprise_linux:7.5:GA:server
>>>>>>>>>>>>             Kernel: Linux 3.10.0-862.14.4.el7.x86_64
>>>>>>>>>>>>       Architecture: x86-64
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> I have configured the following options
>>>>>>>>>>>>
>>>>>>>>>>>> [root at jl-fanexoss1p glusterfs]# gluster volume info
>>>>>>>>>>>> Volume Name: gv0
>>>>>>>>>>>> Type: Replicate
>>>>>>>>>>>> Volume ID: 5ffbda09-c5e2-4abc-b89e-79b5d8a40824
>>>>>>>>>>>> Status: Started
>>>>>>>>>>>> Snapshot Count: 0
>>>>>>>>>>>> Number of Bricks: 1 x 3 = 3
>>>>>>>>>>>> Transport-type: tcp
>>>>>>>>>>>> Bricks:
>>>>>>>>>>>> Brick1: jl-fanexoss1p.cspire.net:/data/brick1/gv0
>>>>>>>>>>>> Brick2: sl-fanexoss2p.cspire.net:/data/brick1/gv0
>>>>>>>>>>>> Brick3: nxquorum1p.cspire.net:/data/brick1/gv0
>>>>>>>>>>>> Options Reconfigured:
>>>>>>>>>>>> performance.io-cache: off
>>>>>>>>>>>> performance.stat-prefetch: off
>>>>>>>>>>>> performance.quick-read: off
>>>>>>>>>>>> performance.parallel-readdir: off
>>>>>>>>>>>> performance.readdir-ahead: off
>>>>>>>>>>>> performance.write-behind: off
>>>>>>>>>>>> performance.read-ahead: off
>>>>>>>>>>>> performance.client-io-threads: off
>>>>>>>>>>>> nfs.disable: on
>>>>>>>>>>>> transport.address-family: inet
>>>>>>>>>>>>
>>>>>>>>>>>> I don't know if it is related, but I am seeing a lot of
>>>>>>>>>>>> [2018-12-27 20:19:23.776080] W [MSGID: 114031]
>>>>>>>>>>>> [client-rpc-fops_v2.c:1932:client4_0_seek_cbk] 2-gv0-client-0: remote
>>>>>>>>>>>> operation failed [No such device or address]
>>>>>>>>>>>> [2018-12-27 20:19:47.735190] E [MSGID: 101191]
>>>>>>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch
>>>>>>>>>>>> handler
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> These msgs were introduced by patch [1]. To the best of my
>>>>>>>>>>> knowledge they are benign. We'll be sending a patch to fix these msgs
>>>>>>>>>>> though.
>>>>>>>>>>>
>>>>>>>>>>> +Mohit Agrawal <moagrawa at redhat.com> +Milind Changire
>>>>>>>>>>> <mchangir at redhat.com> . Can you try to identify why we are
>>>>>>>>>>> seeing these messages? If possible please send a patch to fix this.
>>>>>>>>>>>
>>>>>>>>>>> [1]
>>>>>>>>>>> https://review.gluster.org/r/I578c3fc67713f4234bd3abbec5d3fbba19059ea5
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> And java.io exceptions trying to rename files.
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> When you see the errors is it possible to collect,
>>>>>>>>>>> * strace of the java application (strace -ff -v ...)
>>>>>>>>>>> * fuse-dump of the glusterfs mount (use option --dump-fuse while
>>>>>>>>>>> mounting)?
>>>>>>>>>>>
>>>>>>>>>>> I also need another favour from you. By trail and error, can you
>>>>>>>>>>> point out which of the many performance xlators you've turned off is
>>>>>>>>>>> causing the issue?
>>>>>>>>>>>
>>>>>>>>>>> The above two data-points will help us to fix the problem.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> Thank You,
>>>>>>>>>>>> Dmitry
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Dec 27, 2018 at 3:48 PM Raghavendra Gowdappa <
>>>>>>>>>>>> rgowdapp at redhat.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> What version of glusterfs are you using? It might be either
>>>>>>>>>>>>> * a stale metadata issue.
>>>>>>>>>>>>> * inconsistent ctime issue.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Can you try turning off all performance xlators? If the issue
>>>>>>>>>>>>> is 1, that should help.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, Dec 28, 2018 at 1:51 AM Dmitry Isakbayev <
>>>>>>>>>>>>> isakdim at gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Attempted to set 'performance.read-ahead off` according to
>>>>>>>>>>>>>> https://jira.apache.org/jira/browse/AMQ-7041
>>>>>>>>>>>>>> That did not help.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Mon, Dec 24, 2018 at 2:11 PM Dmitry Isakbayev <
>>>>>>>>>>>>>> isakdim at gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The core file generated by JVM suggests that it happens
>>>>>>>>>>>>>>> because the file is changing while it is being read -
>>>>>>>>>>>>>>> https://bugs.java.com/bugdatabase/view_bug.do?bug_id=8186557
>>>>>>>>>>>>>>> .
>>>>>>>>>>>>>>> The application reads in the zipfile and goes through the
>>>>>>>>>>>>>>> zip entries, then reloads the file and goes the zip entries again.  It does
>>>>>>>>>>>>>>> so 3 times.  The application never crushes on the 1st cycle but sometimes
>>>>>>>>>>>>>>> crushes on the 2nd or 3rd cycle.
>>>>>>>>>>>>>>> The zip file is generated about 20 seconds prior to it being
>>>>>>>>>>>>>>> used and is not updated or even used by any other application.  I have
>>>>>>>>>>>>>>> never seen this problem on a plain file system.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I would appreciate any suggestions on how to go debugging
>>>>>>>>>>>>>>> this issue.  I can change the source code of the java application.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>> Dmitry
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>> Gluster-users mailing list
>>>>>>>>>>>>>> Gluster-users at gluster.org
>>>>>>>>>>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users
>>>>>>>>>>>>>
>>>>>>>>>>>>> _______________________________________________
>> Gluster-users mailing list
>> Gluster-users at gluster.org
>> https://lists.gluster.org/mailman/listinfo/gluster-users
>
>
>
> --
> Amar Tumballi (amarts)
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20190128/6e9d07cc/attachment.html>