[Gluster-users] Files won't heal, although no obvious problem visible

Wed Nov 23 12:22:54 UTC 2016

On 11/23/2016 04:56 PM, Pavel Cernohorsky wrote:
> Hello, thanks for your reply, answers are in the text.
>
> On 11/23/2016 11:55 AM, Ravishankar N wrote:
>> On 11/23/2016 03:56 PM, Pavel Cernohorsky wrote:
>>> The "hot-client-21" is, based on the vol-file, the following of the 
>>> bricks:
>>> option remote-subvolume /opt/data/hdd5/gluster
>>> option remote-host 10.10.27.11
>>>
>>> I have self healing daemon disabled, but when I try to trigger 
>>> healing manually (gluster volume heal <volname>), I get: "Launching 
>>> heal operation to perform index self heal on volume <volname> has 
>>> been unsuccessful on bricks that are down. Please check if all brick 
>>> processes are running.", although all the bricks are online (gluster 
>>> volume status <volname>).
>>
>> Can you enable the self-heal daemon  and try again ?  `gluster volume 
>> heal <volname>` requires the shd to be enabled. The error message 
>> that you get is inappropriate and is being fixed.
>
> When I enabled the self heal daemon, I was able to start healing, and 
> the files were actually healed. What does self-heal daemon do in 
> addition to the automated healing when you read the file?

The lookup/read code-path doesn't seem to be considering a file with 
only the afr.dirty xattr being non-zero as a candidate for heal (while 
the self heal-daemon code-path does) . I'm not sure at this point if it 
should because just afr.dirty being set on all bricks without any  
trusted.afr.xxx-client-xxx being set doesn't seem to be something that 
should be hit under normal circumstances. I'll need to think about this 
more.

>
> The original reason to disable self heal daemon was to be able to 
> control the amount of resources used by the healing, because the 
> "cluster.background-self-heal-count: 1" did not help very much and the 
> amount of both network and disk io consumed was just extreme.
>
> And I am also pretty sure we have seen similar problem (not sure about 
> the attributes) before we disabled the shd.
>
>>
>>>
>>> When I try to just md5sum the file, to trigger automated healing on 
>>> file manipulation, I get the result, but the file is not healed 
>>> anyway. This usually works when I do not get 3 entries for the same 
>>> file in the heal info.
>>
>> Is the file size for 99705_544c0cd369a84ebcaf095b4a9f6d682a.mp4 
>> non-zero on the 2 data bricks (i.e. on 10.10.27.11 and 10.10.27.10) 
>> and do they match?
>> Do the md5sums match with what you got on the mount when you 
>> calculate it directly on these bricks?
>
> The file has non-zero size on both the data bricks, and the md5 sum 
> was the same on both of them before they were healed, after the 
> healing (enabling the shd and healing start) the md5 did not change on 
> either of the data bricks. Mount point reports the same md5 as all the 
> other attempts directly on the bricks. So what is actually happening 
> there? Why was the file blamed (not unblamed after healing?)?

That means there was no real heal pending. But because the dirty xattr 
was set, the shd picked up a brick as a source and did the heal anyway. 
We would need to find how we ended in the 'only afr.dirty xattr was set' 
state for the file.

-Ravi
>
> Thanks for your answers,
> Pavel
>