[Gluster-users] Volume heal info not reporting files in split brain and core dumping, after upgrading to 3.7.0

Pranith Kumar Karampuri pkarampu at redhat.com
Sat May 30 00:52:24 UTC 2015



On 05/29/2015 05:28 PM, Alessandro De Salvo wrote:
> Hi Pranith,
> many thanks. Indeed, I have recompiled the glfsheal executable using the
> changes in your patch set, but without commenting the first entry of
> glfs_fini (the one marked as "Don't we need to comment this too?")
> indeed it still segfaults. After commenting out that one too it seems to
> run fine.
> For the moment I can use this patched executable, until you fix it in a
> release.
Yes you are correct, Ravi in the review also pointed this out. Latest 
version of patch works correctly.

http://review.gluster.org/11002 is the fix for the actual crash which 
will be available in the next release.

Pranith
> Many thanks!
>
> 	Alessandro
>
> On Fri, 2015-05-29 at 16:03 +0530, Pranith Kumar Karampuri wrote:
>>
>> On 05/29/2015 03:36 PM, Alessandro De Salvo wrote:
>>
>>> Hi Pranith,
>>> thanks to you! 2-3 days are fine, don’t worry. However, if you can
>>> give me the details of the compilation of glsheal you are
>>> mentioning, we could have a quick check if everything’s fine with
>>> the fix, before you release. So just let me know what you prefer.
>>> For me waiting 2-3 days is not a problem though, as it is not a
>>> critical server and I could even recreate the volumes.
>> We recently introduced code path which frees up memory in long
>> standing processes. Seems like this is not tested when file-snapshots
>> feature is on. If that option is disabled the crash won't happen.
>> "gluster volume heal <volname> info" Uses the same api. But
>> fortunately this "glfsheal" process will die as soon as heal info
>> output is gathered. So no need to call this freeing of memory just
>> before dying. For now we enabled this code path (patch:
>> http://review.gluster.org/11001) only for internal builds but not in
>> released versions while we stabilize that part of the code. You can
>> take this patch for patching glfsheal.
>>
>> Pranith
>>> Thanks again,
>>>
>>>
>>> Alessandro
>>>
>>>> Il giorno 29/mag/2015, alle ore 11:54, Pranith Kumar Karampuri
>>>> <pkarampu at redhat.com> ha scritto:
>>>>
>>>>
>>>>
>>>> On 05/29/2015 03:16 PM, Alessandro De Salvo wrote:
>>>>
>>>>> Hi Pranith,
>>>>> I’m definitely sure the log is correct, but you are also correct
>>>>> when you say there is no sign of crash (even checking with grep!
>>>>> ).
>>>>> However I see core dumps (e.g. core.19430) in /var/log/gluster)
>>>>> created every time I issue the heal info command.
>>>>>  From gdb I see this:
>>>> Thanks for providing the information Alessandro. We will fix this
>>>> issue. I am wondering how we can unblock you in the interim. There
>>>> is a plan to release 3.7.1 in 2-3 days I think. I can try to make
>>>> this fix for that release. Let me know if you can wait that long?
>>>> Another possibility is to compile just glfsheal binary with the
>>>> fix which "gluster volume heal <volname> info" internally. Let me
>>>> know.
>>>>
>>>> Pranith.
>>>>>
>>>>>
>>>>>
>>>>> GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-64.el7
>>>>> Copyright (C) 2013 Free Software Foundation, Inc.
>>>>> License GPLv3+: GNU GPL version 3 or later
>>>>> <http://gnu.org/licenses/gpl.html>
>>>>> This is free software: you are free to change and redistribute
>>>>> it.
>>>>> There is NO WARRANTY, to the extent permitted by law.  Type
>>>>> "show copying"
>>>>> and "show warranty" for details.
>>>>> This GDB was configured as "x86_64-redhat-linux-gnu".
>>>>> For bug reporting instructions, please see:
>>>>> <http://www.gnu.org/software/gdb/bugs/>...
>>>>> Reading symbols from /usr/sbin/glfsheal...Reading symbols
>>>>> from /usr/lib/debug/usr/sbin/glfsheal.debug...done.
>>>>> done.
>>>>> [New LWP 19430]
>>>>> [New LWP 19431]
>>>>> [New LWP 19434]
>>>>> [New LWP 19436]
>>>>> [New LWP 19433]
>>>>> [New LWP 19437]
>>>>> [New LWP 19432]
>>>>> [New LWP 19435]
>>>>> [Thread debugging using libthread_db enabled]
>>>>> Using host libthread_db library "/lib64/libthread_db.so.1".
>>>>> Core was generated by `/usr/sbin/glfsheal adsnet-vm-01'.
>>>>> Program terminated with signal 11, Segmentation fault.
>>>>> #0  inode_unref (inode=0x7f7a1e27806c) at inode.c:499
>>>>> 499             table = inode->table;
>>>>> (gdb) bt
>>>>> #0  inode_unref (inode=0x7f7a1e27806c) at inode.c:499
>>>>> #1  0x00007f7a265e8a61 in fini (this=<optimized out>) at
>>>>> qemu-block.c:1092
>>>>> #2  0x00007f7a39a53791 in xlator_fini_rec (xl=0x7f7a2000b9a0) at
>>>>> xlator.c:463
>>>>> #3  0x00007f7a39a53725 in xlator_fini_rec (xl=0x7f7a2000d450) at
>>>>> xlator.c:453
>>>>> #4  0x00007f7a39a53725 in xlator_fini_rec (xl=0x7f7a2000e800) at
>>>>> xlator.c:453
>>>>> #5  0x00007f7a39a53725 in xlator_fini_rec (xl=0x7f7a2000fbb0) at
>>>>> xlator.c:453
>>>>> #6  0x00007f7a39a53725 in xlator_fini_rec (xl=0x7f7a20010f80) at
>>>>> xlator.c:453
>>>>> #7  0x00007f7a39a53725 in xlator_fini_rec (xl=0x7f7a20012330) at
>>>>> xlator.c:453
>>>>> #8  0x00007f7a39a53725 in xlator_fini_rec (xl=0x7f7a200136e0) at
>>>>> xlator.c:453
>>>>> #9  0x00007f7a39a53725 in xlator_fini_rec (xl=0x7f7a20014b30) at
>>>>> xlator.c:453
>>>>> #10 0x00007f7a39a53725 in xlator_fini_rec (xl=0x7f7a20015fc0) at
>>>>> xlator.c:453
>>>>> #11 0x00007f7a39a54eea in xlator_tree_fini (xl=<optimized out>)
>>>>> at xlator.c:545
>>>>> #12 0x00007f7a39a90b25 in glusterfs_graph_deactivate
>>>>> (graph=<optimized out>) at graph.c:340
>>>>> #13 0x00007f7a38d50e3c in pub_glfs_fini
>>>>> (fs=fs at entry=0x7f7a3a6b6010) at glfs.c:1155
>>>>> #14 0x00007f7a39f18ed4 in main (argc=<optimized out>,
>>>>> argv=<optimized out>) at glfs-heal.c:821
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Thanks,
>>>>>
>>>>>
>>>>> Alessandro
>>>>>
>>>>>> Il giorno 29/mag/2015, alle ore 11:12, Pranith Kumar Karampuri
>>>>>> <pkarampu at redhat.com> ha scritto:
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 05/29/2015 02:37 PM, Alessandro De Salvo wrote:
>>>>>>
>>>>>>> Hi Pranith,
>>>>>>> many thanks for the help!
>>>>>>> The volume info of the problematic volume is the following:
>>>>>>>
>>>>>>>
>>>>>>> # gluster volume info adsnet-vm-01
>>>>>>>   
>>>>>>> Volume Name: adsnet-vm-01
>>>>>>> Type: Replicate
>>>>>>> Volume ID: f8f615df-3dde-4ea6-9bdb-29a1706e864c
>>>>>>> Status: Started
>>>>>>> Number of Bricks: 1 x 2 = 2
>>>>>>> Transport-type: tcp
>>>>>>> Bricks:
>>>>>>> Brick1: gwads02.sta.adsnet.it:/gluster/vm01/data
>>>>>>> Brick2: gwads03.sta.adsnet.it:/gluster/vm01/data
>>>>>>> Options Reconfigured:
>>>>>>> nfs.disable: true
>>>>>>> features.barrier: disable
>>>>>>> features.file-snapshot: on
>>>>>>> server.allow-insecure: on
>>>>>> Are you sure the attached log is correct? I do not see any
>>>>>> backtrace in the log file to indicate there is a crash :-(.
>>>>>> Could you do "grep -i crash /var/log/glusterfs/*" to see if
>>>>>> there is some other file with the crash. If that also fails,
>>>>>> will it be possible for you to provide the backtrace of the
>>>>>> core by opening it using gdb?
>>>>>>
>>>>>> Pranith
>>>>>>>
>>>>>>> The log is in attachment.
>>>>>>> I just wanted to add that the heal info command works fine
>>>>>>> on other volumes hosted by the same machines, so it’s just
>>>>>>> this volume which is causing problems.
>>>>>>> Thanks,
>>>>>>>
>>>>>>>
>>>>>>> Alessandro
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> Il giorno 29/mag/2015, alle ore 10:50, Pranith Kumar
>>>>>>>> Karampuri <pkarampu at redhat.com> ha scritto:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 05/29/2015 02:18 PM, Pranith Kumar Karampuri wrote:
>>>>>>>>>
>>>>>>>>> On 05/29/2015 02:13 PM, Alessandro De Salvo wrote:
>>>>>>>>>> Hi,
>>>>>>>>>> I'm facing a strange issue with split brain reporting.
>>>>>>>>>> I have upgraded to 3.7.0, after stopping all gluster
>>>>>>>>>> processes as described in the twiki, on all servers
>>>>>>>>>> hosting the volumes. The upgrade and the restart was
>>>>>>>>>> fine, and the volumes are accessible.
>>>>>>>>>> However I had two files in split brain that I did not
>>>>>>>>>> heal before upgrading, so I tried a full heal with
>>>>>>>>>> 3.7.0. The heal was launched correctly, but when I now
>>>>>>>>>> perform an heal info there is no output, while the
>>>>>>>>>> heal statistics says there are actually 2 files in
>>>>>>>>>> split brain. In the logs I see something like this:
>>>>>>>>>>
>>>>>>>>>> glustershd.log:
>>>>>>>>>> [2015-05-29 08:28:43.008373] I
>>>>>>>>>> [afr-self-heal-entry.c:558:afr_selfheal_entry_do]
>>>>>>>>>> 0-adsnet-gluster-01-replicate-0: performing entry
>>>>>>>>>> selfheal on 7fd1262d-949b-402e-96c2-ae487c8d4e27
>>>>>>>>>> [2015-05-29 08:28:43.012690] W
>>>>>>>>>> [client-rpc-fops.c:241:client3_3_mknod_cbk]
>>>>>>>>>> 0-adsnet-gluster-01-client-1: remote operation failed:
>>>>>>>>>> Invalid argument. Path: (null)
>>>>>>>>> Hey could you let us know "gluster volume info" output?
>>>>>>>>> Please let us know the backtrace printed
>>>>>>>>> by /var/log/glusterfs/glfsheal-<volname>.log as well.
>>>>>>>> Please attach /var/log/glusterfs/glfsheal-<volname>.log
>>>>>>>> file to this thread so that I can take a look.
>>>>>>>>
>>>>>>>> Pranith
>>>>>>>>> Pranith
>>>>>>>>>>
>>>>>>>>>> So, it seems like the files to be healed are not
>>>>>>>>>> correctly identified, or at least their path is null.
>>>>>>>>>> Also, every time I issue a "gluster volume heal
>>>>>>>>>> <volname> info" a core dump is generated in the log
>>>>>>>>>> area.
>>>>>>>>>> All servers are using the latest CentOS 7.
>>>>>>>>>> Any idea why this might be happening and how to solve
>>>>>>>>>> it?
>>>>>>>>>> Thanks,
>>>>>>>>>>
>>>>>>>>>>     Alessandro
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> Gluster-users mailing list
>>>>>>>>>> Gluster-users at gluster.org
>>>>>>>>>> http://www.gluster.org/mailman/listinfo/gluster-users
>>>>>>>>> _______________________________________________
>>>>>>>>> Gluster-users mailing list
>>>>>>>>> Gluster-users at gluster.org
>>>>>>>>> http://www.gluster.org/mailman/listinfo/gluster-users
>>>>>>
>>>>
>



More information about the Gluster-users mailing list