[Gluster-users] [Gluster-devel] Query on healing process
Ravishankar N
ravishankar at redhat.com
Thu Mar 3 10:40:21 UTC 2016
Hi,
On 03/03/2016 11:14 AM, ABHISHEK PALIWAL wrote:
> Hi Ravi,
>
> As I discussed earlier this issue, I investigated this issue and find
> that healing is not triggered because the "gluster volume heal
> c_glusterfs info split-brain" command not showing any entries as a
> outcome of this command even though the file in split brain case.
Couple of observations from the 'commands_output' file.
getfattr -d -m . -e hex
opt/lvmdir/c2/brick/logfiles/availability/CELLO_AVAILABILITY2_LOG.xml
The afr xattrs do not indicate that the file is in split brain:
# file:
opt/lvmdir/c2/brick/logfiles/availability/CELLO_AVAILABILITY2_LOG.xml
trusted.afr.c_glusterfs-client-1=0x000000000000000000000000
trusted.afr.dirty=0x000000000000000000000000
trusted.bit-rot.version=0x000000000000000b56d6dd1d000ec7a9
trusted.gfid=0x9f5e354ecfda40149ddce7d5ffe760ae
getfattr -d -m . -e hex
opt/lvmdir/c2/brick/logfiles/availability/CELLO_AVAILABILITY2_LOG.xml
trusted.afr.c_glusterfs-client-0=0x000000080000000000000000
trusted.afr.c_glusterfs-client-2=0x000000020000000000000000
trusted.afr.c_glusterfs-client-4=0x000000020000000000000000
trusted.afr.c_glusterfs-client-6=0x000000020000000000000000
trusted.afr.dirty=0x000000000000000000000000
trusted.bit-rot.version=0x000000000000000b56d6dcb7000c87e7
trusted.gfid=0x9f5e354ecfda40149ddce7d5ffe760ae
1. There doesn't seem to be a split-brain going by the trusted.afr* xattrs.
2. You seem to have re-used the bricks from another volume/setup. For
replica 2, only trusted.afr.c_glusterfs-client-0 and
trusted.afr.c_glusterfs-client-1 must be present but I see 4 xattrs -
client-0,2,4 and 6
3. On the rebooted node, do you have ssl enabled by any chance? There is
a bug for "Not able to fetch volfile' when ssl is enabled:
https://bugzilla.redhat.com/show_bug.cgi?id=1258931
Btw, you for data and metadata split-brains you can use the gluster CLI
https://github.com/gluster/glusterfs-specs/blob/master/done/Features/heal-info-and-split-brain-resolution.md
instead of modifying the file from the back end.
-Ravi
>
> So, what I have done I manually deleted the gfid entry of that file
> from .glusterfs directory and follow the instruction mentioned in the
> following link to do heal
>
> https://github.com/gluster/glusterfs/blob/master/doc/debugging/split-brain.md
>
> and this works fine for me.
>
> But my question is why the split-brain command not showing any file in
> output.
>
> Here I am attaching all the log which I get from the node for you and
> also the output of commands from both of the boards
>
> In this tar file two directories are present
>
> 000300 - log for the board which is running continuously
> 002500- log for the board which is rebooted
>
> I am waiting for your reply please help me out on this issue.
>
> Thanks in advanced.
>
> Regards,
> Abhishek
>
> On Fri, Feb 26, 2016 at 1:21 PM, ABHISHEK PALIWAL
> <abhishpaliwal at gmail.com <mailto:abhishpaliwal at gmail.com>> wrote:
>
> On Fri, Feb 26, 2016 at 10:28 AM, Ravishankar N
> <ravishankar at redhat.com <mailto:ravishankar at redhat.com>> wrote:
>
> On 02/26/2016 10:10 AM, ABHISHEK PALIWAL wrote:
>>
>> Yes correct
>>
>
> Okay, so when you say the files are not in sync until some
> time, are you getting stale data when accessing from the mount?
> I'm not able to figure out why heal info shows zero when the
> files are not in sync, despite all IO happening from the
> mounts. Could you provide the output of getfattr -d -m . -e
> hex /brick/file-name from both bricks when you hit this issue?
>
> I'll provide the logs once I get. here delay means we are
> powering on the second board after the 10 minutes.
>
>
>> On Feb 26, 2016 9:57 AM, "Ravishankar N"
>> <ravishankar at redhat.com <mailto:ravishankar at redhat.com>> wrote:
>>
>> Hello,
>>
>> On 02/26/2016 08:29 AM, ABHISHEK PALIWAL wrote:
>>> Hi Ravi,
>>>
>>> Thanks for the response.
>>>
>>> We are using Glugsterfs-3.7.8
>>>
>>> Here is the use case:
>>>
>>> We have a logging file which saves logs of the events
>>> for every board of a node and these files are in sync
>>> using glusterfs. System in replica 2 mode it means When
>>> one brick in a replicated volume goes offline, the
>>> glusterd daemons on the other nodes keep track of all
>>> the files that are not replicated to the offline brick.
>>> When the offline brick becomes available again, the
>>> cluster initiates a healing process, replicating the
>>> updated files to that brick. But in our casse, we see
>>> that log file of one board is not in the sync and its
>>> format is corrupted means files are not in sync.
>>
>> Just to understand you correctly, you have mounted the 2
>> node replica-2 volume on both these nodes and writing to
>> a logging file from the mounts right?
>>
>>>
>>> Even the outcome of #gluster volume heal c_glusterfs
>>> info shows that there is no pending heals.
>>>
>>> Also , The logging file which is updated is of fixed
>>> size and the new entries will be wrapped ,overwriting
>>> the old entries.
>>>
>>> This way we have seen that after few restarts , the
>>> contents of the same file on two bricks are different ,
>>> but the volume heal info shows zero entries
>>>
>>> Solution:
>>>
>>> But when we tried to put delay > 5 min before the
>>> healing everything is working fine.
>>>
>>> Regards,
>>> Abhishek
>>>
>>> On Fri, Feb 26, 2016 at 6:35 AM, Ravishankar N
>>> <ravishankar at redhat.com <mailto:ravishankar at redhat.com>>
>>> wrote:
>>>
>>> On 02/25/2016 06:01 PM, ABHISHEK PALIWAL wrote:
>>>> Hi,
>>>>
>>>> Here, I have one query regarding the time taken by
>>>> the healing process.
>>>> In current two node setup when we rebooted one node
>>>> then the self-healing process starts less than 5min
>>>> interval on the board which resulting the
>>>> corruption of the some files data.
>>>
>>> Heal should start immediately after the brick
>>> process comes up. What version of gluster are you
>>> using? What do you mean by corruption of data? Also,
>>> how did you observe that the heal started after 5
>>> minutes?
>>> -Ravi
>>>>
>>>> And to resolve it I have search on google and found
>>>> the following link:
>>>> https://support.rackspace.com/how-to/glusterfs-troubleshooting/
>>>>
>>>> Mentioning that the healing process can takes upto
>>>> 10min of time to start this process.
>>>>
>>>> Here is the statement from the link:
>>>>
>>>> "Healing replicated volumes
>>>>
>>>> When any brick in a replicated volume goes offline,
>>>> the glusterd daemons on the remaining nodes keep
>>>> track of all the files that are not replicated to
>>>> the offline brick. When the offline brick becomes
>>>> available again, the cluster initiates a healing
>>>> process, replicating the updated files to that
>>>> brick. *The start of this process can take up to 10
>>>> minutes, based on observation.*"
>>>>
>>>> After giving the time of more than 5 min file
>>>> corruption problem has been resolved.
>>>>
>>>> So, Here my question is there any way through which
>>>> we can reduce the time taken by the healing process
>>>> to start?
>>>>
>>>>
>>>> Regards,
>>>> Abhishek Paliwal
>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Gluster-devel mailing list
>>>> Gluster-devel at gluster.org
>>>> <mailto:Gluster-devel at gluster.org>
>>>> http://www.gluster.org/mailman/listinfo/gluster-devel
>>>
>>>
>>>
>>>
>>>
>>> --
>>>
>>>
>>>
>>>
>>> Regards
>>> Abhishek Paliwal
>>
>>
>
>
>
>
>
> --
>
>
>
>
> Regards
> Abhishek Paliwal
>
>
>
>
> --
>
>
>
>
> Regards
> Abhishek Paliwal
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20160303/34441e6b/attachment.html>
More information about the Gluster-users
mailing list