[Gluster-users] Volume heal info not reporting files in split brain and core dumping, after upgrading to 3.7.0

Pranith Kumar Karampuri pkarampu at redhat.com
Fri May 29 10:33:40 UTC 2015



On 05/29/2015 03:36 PM, Alessandro De Salvo wrote:
> Hi Pranith,
> thanks to you! 2-3 days are fine, don’t worry. However, if you can 
> give me the details of the compilation of glsheal you are mentioning, 
> we could have a quick check if everything’s fine with the fix, before 
> you release. So just let me know what you prefer. For me waiting 2-3 
> days is not a problem though, as it is not a critical server and I 
> could even recreate the volumes.

We recently introduced code path which frees up memory in long standing 
processes. Seems like this is not tested when file-snapshots feature is 
on. If that option is disabled the crash won't happen. "gluster volume 
heal <volname> info" Uses the same api. But fortunately this "glfsheal" 
process will die as soon as heal info output is gathered. So no need to 
call this freeing of memory just before dying. For now we enabled this 
code path (patch: http://review.gluster.org/11001) only for internal 
builds but not in released versions while we stabilize that part of the 
code. You can take this patch for patching glfsheal.

Pranith
> Thanks again,
>
> Alessandro
>
>> Il giorno 29/mag/2015, alle ore 11:54, Pranith Kumar Karampuri 
>> <pkarampu at redhat.com <mailto:pkarampu at redhat.com>> ha scritto:
>>
>>
>>
>> On 05/29/2015 03:16 PM, Alessandro De Salvo wrote:
>>> Hi Pranith,
>>> I’m definitely sure the log is correct, but you are also correct 
>>> when you say there is no sign of crash (even checking with grep!).
>>> However I see core dumps (e.g. core.19430) in /var/log/gluster) 
>>> created every time I issue the heal info command.
>>> From gdb I see this:
>> Thanks for providing the information Alessandro. We will fix this 
>> issue. I am wondering how we can unblock you in the interim. There is 
>> a plan to release 3.7.1 in 2-3 days I think. I can try to make this 
>> fix for that release. Let me know if you can wait that long? Another 
>> possibility is to compile just glfsheal binary with the fix which 
>> "gluster volume heal <volname> info" internally. Let me know.
>>
>> Pranith.
>>>
>>>
>>> GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-64.el7
>>> Copyright (C) 2013 Free Software Foundation, Inc.
>>> License GPLv3+: GNU GPL version 3 or later 
>>> <http://gnu.org/licenses/gpl.html>
>>> This is free software: you are free to change and redistribute it.
>>> There is NO WARRANTY, to the extent permitted by law.  Type "show 
>>> copying"
>>> and "show warranty" for details.
>>> This GDB was configured as "x86_64-redhat-linux-gnu".
>>> For bug reporting instructions, please see:
>>> <http://www.gnu.org/software/gdb/bugs/>...
>>> Reading symbols from /usr/sbin/glfsheal...Reading symbols from 
>>> /usr/lib/debug/usr/sbin/glfsheal.debug...done.
>>> done.
>>> [New LWP 19430]
>>> [New LWP 19431]
>>> [New LWP 19434]
>>> [New LWP 19436]
>>> [New LWP 19433]
>>> [New LWP 19437]
>>> [New LWP 19432]
>>> [New LWP 19435]
>>> [Thread debugging using libthread_db enabled]
>>> Using host libthread_db library "/lib64/libthread_db.so.1".
>>> Core was generated by `/usr/sbin/glfsheal adsnet-vm-01'.
>>> Program terminated with signal 11, Segmentation fault.
>>> #0  inode_unref (inode=0x7f7a1e27806c) at inode.c:499
>>> 499             table = inode->table;
>>> (gdb) bt
>>> #0  inode_unref (inode=0x7f7a1e27806c) at inode.c:499
>>> #1  0x00007f7a265e8a61 in fini (this=<optimized out>) at 
>>> qemu-block.c:1092
>>> #2  0x00007f7a39a53791 in xlator_fini_rec (xl=0x7f7a2000b9a0) at 
>>> xlator.c:463
>>> #3  0x00007f7a39a53725 in xlator_fini_rec (xl=0x7f7a2000d450) at 
>>> xlator.c:453
>>> #4  0x00007f7a39a53725 in xlator_fini_rec (xl=0x7f7a2000e800) at 
>>> xlator.c:453
>>> #5  0x00007f7a39a53725 in xlator_fini_rec (xl=0x7f7a2000fbb0) at 
>>> xlator.c:453
>>> #6  0x00007f7a39a53725 in xlator_fini_rec (xl=0x7f7a20010f80) at 
>>> xlator.c:453
>>> #7  0x00007f7a39a53725 in xlator_fini_rec (xl=0x7f7a20012330) at 
>>> xlator.c:453
>>> #8  0x00007f7a39a53725 in xlator_fini_rec (xl=0x7f7a200136e0) at 
>>> xlator.c:453
>>> #9  0x00007f7a39a53725 in xlator_fini_rec (xl=0x7f7a20014b30) at 
>>> xlator.c:453
>>> #10 0x00007f7a39a53725 in xlator_fini_rec (xl=0x7f7a20015fc0) at 
>>> xlator.c:453
>>> #11 0x00007f7a39a54eea in xlator_tree_fini (xl=<optimized out>) at 
>>> xlator.c:545
>>> #12 0x00007f7a39a90b25 in glusterfs_graph_deactivate 
>>> (graph=<optimized out>) at graph.c:340
>>> #13 0x00007f7a38d50e3c in pub_glfs_fini (fs=fs at entry=0x7f7a3a6b6010) 
>>> at glfs.c:1155
>>> #14 0x00007f7a39f18ed4 in main (argc=<optimized out>, 
>>> argv=<optimized out>) at glfs-heal.c:821
>>>
>>>
>>> Thanks,
>>>
>>> Alessandro
>>>
>>>> Il giorno 29/mag/2015, alle ore 11:12, Pranith Kumar Karampuri 
>>>> <pkarampu at redhat.com <mailto:pkarampu at redhat.com>> ha scritto:
>>>>
>>>>
>>>>
>>>> On 05/29/2015 02:37 PM, Alessandro De Salvo wrote:
>>>>> Hi Pranith,
>>>>> many thanks for the help!
>>>>> The volume info of the problematic volume is the following:
>>>>>
>>>>> # gluster volume info adsnet-vm-01
>>>>> Volume Name: adsnet-vm-01
>>>>> Type: Replicate
>>>>> Volume ID: f8f615df-3dde-4ea6-9bdb-29a1706e864c
>>>>> Status: Started
>>>>> Number of Bricks: 1 x 2 = 2
>>>>> Transport-type: tcp
>>>>> Bricks:
>>>>> Brick1: gwads02.sta.adsnet.it 
>>>>> <http://gwads02.sta.adsnet.it/>:/gluster/vm01/data
>>>>> Brick2: gwads03.sta.adsnet.it 
>>>>> <http://gwads03.sta.adsnet.it/>:/gluster/vm01/data
>>>>> Options Reconfigured:
>>>>> nfs.disable: true
>>>>> features.barrier: disable
>>>>> features.file-snapshot: on
>>>>> server.allow-insecure: on
>>>> Are you sure the attached log is correct? I do not see any 
>>>> backtrace in the log file to indicate there is a crash :-(. Could 
>>>> you do "grep -i crash /var/log/glusterfs/*" to see if there is some 
>>>> other file with the crash. If that also fails, will it be possible 
>>>> for you to provide the backtrace of the core by opening it using gdb?
>>>>
>>>> Pranith
>>>>>
>>>>> The log is in attachment.
>>>>> I just wanted to add that the heal info command works fine on 
>>>>> other volumes hosted by the same machines, so it’s just this 
>>>>> volume which is causing problems.
>>>>> Thanks,
>>>>>
>>>>> Alessandro
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> Il giorno 29/mag/2015, alle ore 10:50, Pranith Kumar Karampuri 
>>>>>> <pkarampu at redhat.com <mailto:pkarampu at redhat.com>> ha scritto:
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 05/29/2015 02:18 PM, Pranith Kumar Karampuri wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 05/29/2015 02:13 PM, Alessandro De Salvo wrote:
>>>>>>>> Hi,
>>>>>>>> I'm facing a strange issue with split brain reporting.
>>>>>>>> I have upgraded to 3.7.0, after stopping all gluster processes 
>>>>>>>> as described in the twiki, on all servers hosting the volumes. 
>>>>>>>> The upgrade and the restart was fine, and the volumes are 
>>>>>>>> accessible.
>>>>>>>> However I had two files in split brain that I did not heal 
>>>>>>>> before upgrading, so I tried a full heal with 3.7.0. The heal 
>>>>>>>> was launched correctly, but when I now perform an heal info 
>>>>>>>> there is no output, while the heal statistics says there are 
>>>>>>>> actually 2 files in split brain. In the logs I see something 
>>>>>>>> like this:
>>>>>>>>
>>>>>>>> glustershd.log:
>>>>>>>> [2015-05-29 08:28:43.008373] I 
>>>>>>>> [afr-self-heal-entry.c:558:afr_selfheal_entry_do] 
>>>>>>>> 0-adsnet-gluster-01-replicate-0: performing entry selfheal on 
>>>>>>>> 7fd1262d-949b-402e-96c2-ae487c8d4e27
>>>>>>>> [2015-05-29 08:28:43.012690] W 
>>>>>>>> [client-rpc-fops.c:241:client3_3_mknod_cbk] 
>>>>>>>> 0-adsnet-gluster-01-client-1: remote operation failed: Invalid 
>>>>>>>> argument. Path: (null)
>>>>>>> Hey could you let us know "gluster volume info" output? Please 
>>>>>>> let us know the backtrace printed by 
>>>>>>> /var/log/glusterfs/glfsheal-<volname>.log as well.
>>>>>> Please attach /var/log/glusterfs/glfsheal-<volname>.log file to 
>>>>>> this thread so that I can take a look.
>>>>>>
>>>>>> Pranith
>>>>>>>
>>>>>>> Pranith
>>>>>>>>
>>>>>>>>
>>>>>>>> So, it seems like the files to be healed are not correctly 
>>>>>>>> identified, or at least their path is null.
>>>>>>>> Also, every time I issue a "gluster volume heal <volname> info" 
>>>>>>>> a core dump is generated in the log area.
>>>>>>>> All servers are using the latest CentOS 7.
>>>>>>>> Any idea why this might be happening and how to solve it?
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>>    Alessandro
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Gluster-users mailing list
>>>>>>>> Gluster-users at gluster.org <mailto:Gluster-users at gluster.org>
>>>>>>>> http://www.gluster.org/mailman/listinfo/gluster-users
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Gluster-users mailing list
>>>>>>> Gluster-users at gluster.org <mailto:Gluster-users at gluster.org>
>>>>>>> http://www.gluster.org/mailman/listinfo/gluster-users
>>>>>
>>>>
>>>
>>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20150529/a8b5b93c/attachment.html>


More information about the Gluster-users mailing list