[Gluster-users] Files won't heal, although no obvious problem visible
pavel.cernohorsky at appeartv.com
Wed Nov 23 12:40:52 UTC 2016
I afraid I do not know how we got to this strange state, I do not know
Gluster in detail enough. When does the trusted.afr.dirty flag get set?
And when does the trusted.afr.xxx-client-xxx flag get set? From what you
are saying, it seems to me that you always expect them to be set /
cleared in the same moment.
If it helps you, at the end of my message you can find the full volume
Can I help you somehow more in actually discovering what happened /
fixing the problem?
Thanks for your help, kind regards,
Volume Name: hot
Volume ID: 4d09dd56-97b6-4b63-8765-0a08574e8ddd
Snapshot Count: 0
Number of Bricks: 12 x (2 + 1) = 36
Brick3: 10.10.27.11:/opt/data/ssd/arbiter1 (arbiter)
... similar triplets here ...
Brick36: 10.10.27.10:/opt/data/ssd/arbiter12 (arbiter)
On 11/23/2016 01:22 PM, Ravishankar N wrote:
> On 11/23/2016 04:56 PM, Pavel Cernohorsky wrote:
>> Hello, thanks for your reply, answers are in the text.
>> On 11/23/2016 11:55 AM, Ravishankar N wrote:
>>> On 11/23/2016 03:56 PM, Pavel Cernohorsky wrote:
>>>> The "hot-client-21" is, based on the vol-file, the following of the
>>>> option remote-subvolume /opt/data/hdd5/gluster
>>>> option remote-host 10.10.27.11
>>>> I have self healing daemon disabled, but when I try to trigger
>>>> healing manually (gluster volume heal <volname>), I get: "Launching
>>>> heal operation to perform index self heal on volume <volname> has
>>>> been unsuccessful on bricks that are down. Please check if all
>>>> brick processes are running.", although all the bricks are online
>>>> (gluster volume status <volname>).
>>> Can you enable the self-heal daemon and try again ? `gluster
>>> volume heal <volname>` requires the shd to be enabled. The error
>>> message that you get is inappropriate and is being fixed.
>> When I enabled the self heal daemon, I was able to start healing, and
>> the files were actually healed. What does self-heal daemon do in
>> addition to the automated healing when you read the file?
> The lookup/read code-path doesn't seem to be considering a file with
> only the afr.dirty xattr being non-zero as a candidate for heal (while
> the self heal-daemon code-path does) . I'm not sure at this point if
> it should because just afr.dirty being set on all bricks without any
> trusted.afr.xxx-client-xxx being set doesn't seem to be something that
> should be hit under normal circumstances. I'll need to think about
> this more.
>> The original reason to disable self heal daemon was to be able to
>> control the amount of resources used by the healing, because the
>> "cluster.background-self-heal-count: 1" did not help very much and
>> the amount of both network and disk io consumed was just extreme.
>> And I am also pretty sure we have seen similar problem (not sure
>> about the attributes) before we disabled the shd.
>>>> When I try to just md5sum the file, to trigger automated healing on
>>>> file manipulation, I get the result, but the file is not healed
>>>> anyway. This usually works when I do not get 3 entries for the same
>>>> file in the heal info.
>>> Is the file size for 99705_544c0cd369a84ebcaf095b4a9f6d682a.mp4
>>> non-zero on the 2 data bricks (i.e. on 10.10.27.11 and 10.10.27.10)
>>> and do they match?
>>> Do the md5sums match with what you got on the mount when you
>>> calculate it directly on these bricks?
>> The file has non-zero size on both the data bricks, and the md5 sum
>> was the same on both of them before they were healed, after the
>> healing (enabling the shd and healing start) the md5 did not change
>> on either of the data bricks. Mount point reports the same md5 as all
>> the other attempts directly on the bricks. So what is actually
>> happening there? Why was the file blamed (not unblamed after healing?)?
> That means there was no real heal pending. But because the dirty xattr
> was set, the shd picked up a brick as a source and did the heal
> anyway. We would need to find how we ended in the 'only afr.dirty
> xattr was set' state for the file.
>> Thanks for your answers,
More information about the Gluster-users