[Gluster-users] Files won't heal, although no obvious problem visible
Ravishankar N
ravishankar at redhat.com
Wed Nov 23 12:22:54 UTC 2016
On 11/23/2016 04:56 PM, Pavel Cernohorsky wrote:
> Hello, thanks for your reply, answers are in the text.
>
> On 11/23/2016 11:55 AM, Ravishankar N wrote:
>> On 11/23/2016 03:56 PM, Pavel Cernohorsky wrote:
>>> The "hot-client-21" is, based on the vol-file, the following of the
>>> bricks:
>>> option remote-subvolume /opt/data/hdd5/gluster
>>> option remote-host 10.10.27.11
>>>
>>> I have self healing daemon disabled, but when I try to trigger
>>> healing manually (gluster volume heal <volname>), I get: "Launching
>>> heal operation to perform index self heal on volume <volname> has
>>> been unsuccessful on bricks that are down. Please check if all brick
>>> processes are running.", although all the bricks are online (gluster
>>> volume status <volname>).
>>
>> Can you enable the self-heal daemon and try again ? `gluster volume
>> heal <volname>` requires the shd to be enabled. The error message
>> that you get is inappropriate and is being fixed.
>
> When I enabled the self heal daemon, I was able to start healing, and
> the files were actually healed. What does self-heal daemon do in
> addition to the automated healing when you read the file?
The lookup/read code-path doesn't seem to be considering a file with
only the afr.dirty xattr being non-zero as a candidate for heal (while
the self heal-daemon code-path does) . I'm not sure at this point if it
should because just afr.dirty being set on all bricks without any
trusted.afr.xxx-client-xxx being set doesn't seem to be something that
should be hit under normal circumstances. I'll need to think about this
more.
>
> The original reason to disable self heal daemon was to be able to
> control the amount of resources used by the healing, because the
> "cluster.background-self-heal-count: 1" did not help very much and the
> amount of both network and disk io consumed was just extreme.
>
> And I am also pretty sure we have seen similar problem (not sure about
> the attributes) before we disabled the shd.
>
>>
>>>
>>> When I try to just md5sum the file, to trigger automated healing on
>>> file manipulation, I get the result, but the file is not healed
>>> anyway. This usually works when I do not get 3 entries for the same
>>> file in the heal info.
>>
>> Is the file size for 99705_544c0cd369a84ebcaf095b4a9f6d682a.mp4
>> non-zero on the 2 data bricks (i.e. on 10.10.27.11 and 10.10.27.10)
>> and do they match?
>> Do the md5sums match with what you got on the mount when you
>> calculate it directly on these bricks?
>
> The file has non-zero size on both the data bricks, and the md5 sum
> was the same on both of them before they were healed, after the
> healing (enabling the shd and healing start) the md5 did not change on
> either of the data bricks. Mount point reports the same md5 as all the
> other attempts directly on the bricks. So what is actually happening
> there? Why was the file blamed (not unblamed after healing?)?
That means there was no real heal pending. But because the dirty xattr
was set, the shd picked up a brick as a source and did the heal anyway.
We would need to find how we ended in the 'only afr.dirty xattr was set'
state for the file.
-Ravi
>
> Thanks for your answers,
> Pavel
>
More information about the Gluster-users
mailing list