[Gluster-users] Freezing during heal

Lindsay Mathieson lindsay.mathieson at gmail.com
Mon Apr 25 12:50:53 UTC 2016


Good luck!

On 25/04/2016 10:01 PM, Kevin Lemonnier wrote:
> Hi,
>
> So I'm trying that now.
> I installed 3.7.11 on two nodes and put a few VMs on it, same config
> as before but with 64MB shards and the heal algo to full. As expected,
> if I poweroff one of the nodes, everything is dead, which is fine.
>
> Now I'm adding a third node, a big heal was started after the add-brick
> of everything (7000+ shards), and for now everything seems to be working
> fine on the VMs. Last time I tried adding a brick, all those VM died for
> the duration of the heal, so that's already pretty good.
>
> I'm gonna let it finish to copy everything on the new nodes, then I'll try
> to simulate nodes going down to see if my original problem of freezing and
> low heal time is solved with this config.
> For reference, here is the volume info, if someone sees something I should change :
>
> Volume Name: gluster
> Type: Replicate
> Volume ID: e4f01509-beaf-447d-821f-957cc5c20c0a
> Status: Started
> Number of Bricks: 1 x 3 = 3
> Transport-type: tcp
> Bricks:
> Brick1: ipvr2.client_name:/mnt/storage/gluster
> Brick2: ipvr3.client_name:/mnt/storage/gluster
> Brick3: ipvr50.client_name:/mnt/storage/gluster
> Options Reconfigured:
> cluster.quorum-type: auto
> cluster.server-quorum-type: server
> network.remote-dio: enable
> cluster.eager-lock: enable
> performance.quick-read: off
> performance.read-ahead: off
> performance.io-cache: off
> performance.stat-prefetch: off
> features.shard: on
> features.shard-block-size: 64MB
> cluster.data-self-heal-algorithm: full
> performance.readdir-ahead: on
>
>
> It starts at 2 and jumps to 50 because the first server is doing something else for now,
> and I use 50 to be the temporary third node. If everything goes well, I'll migrate the production
> on the cluster, re-install the first server and do a replace-brick, which I hope will work just as well
> as the add-brick I'm doing now. Last replace-brick also brought everything down, but I guess that was the
> joy of 3.7.6 :).
>
> Thanks !
>
>
> On Mon, Apr 18, 2016 at 08:17:05PM +0530, Krutika Dhananjay wrote:
>> On Mon, Apr 18, 2016 at 8:02 PM, Kevin Lemonnier <lemonnierk at ulrar.net>
>> wrote:
>>
>>> I will try migrating to 3.7.10, is it considered stable yet ?
>>>
>> Oops, just realized 3.7.10 had a regression. Then 3.7.11 it is. :)
>>
>>
>>> Should I change the self heal algorithm even if I move to 3.7.10, or is
>>> that not necessary ?
>>> Not sure what that change might do.
>>>
>> So the other algorithm is 'diff' which computes rolling checksum on chunks
>> of the src(es) and sink(s), compares them and heals upon mismatch. This is
>> known to consume lot of CPU. 'full' algo on the other hand simply copies
>> the src into sink in chunks. With sharding, it shouldn't be all that bad
>> copying a 256MB file (in your case) from src to sink. We've used double the
>> block size and had no issues reported.
>>
>> So you could change self heal algo to full even in the upgraded cluster.
>>
>> -Krutika
>>
>>
>>> Anyway, I'll try to create a 3.7.10 cluster in the week end slowly move
>>> the VMs on it then,
>>> Thanks a lot for your help,
>>>
>>> Regards
>>>
>>>
>>> On Mon, Apr 18, 2016 at 07:58:44PM +0530, Krutika Dhananjay wrote:
>>>> Hi,
>>>>
>>>> Yeah, so the fuse mount log didn't convey much information.
>>>>
>>>> So one of the reasons heal may have taken so long (and also consumed
>>>> resources) is because of a bug in self-heal where it would do heal from
>>>> both source bricks in 3-way replication. With such a bug, heal would take
>>>> twice the amount of time and consume resources both the times by the same
>>>> amount.
>>>>
>>>> This issue is fixed at http://review.gluster.org/#/c/14008/ and will be
>>>> available in 3.7.12.
>>>>
>>>> The other thing you could do is to set cluster.data-self-heal-algorithm
>>> to
>>>> 'full', for better heal performance and more regulated resource
>>> consumption
>>>> by the same.
>>>>   #gluster volume set <VOL> cluster.data-self-heal-algorithm full
>>>>
>>>> As far as sharding is concerned, some critical caching issues were fixed
>>> in
>>>> 3.7.7 and 3.7.8.
>>>> And my guess is that the vm crash/unbootable state could be because of
>>> this
>>>> issue, which exists in 3.7.6.
>>>>
>>>> 3.7.10 saw the introduction of throttled client side heals which also
>>> moves
>>>> such heals to the background, which is all the more helpful for
>>> preventing
>>>> starvation of vms during client heal.
>>>>
>>>> Considering these factors, I think it would be better if you upgraded
>>> your
>>>> machines to 3.7.10.
>>>>
>>>> Do let me know if migrating to 3.7.10 solves your issues.
>>>>
>>>> -Krutika
>>>>
>>>> On Mon, Apr 18, 2016 at 12:40 PM, Kevin Lemonnier <lemonnierk at ulrar.net>
>>>> wrote:
>>>>
>>>>> Yes, but as I was saying I don't believe KVM is using a mount point, I
>>>>> think it uses
>>>>> the API (
>>>>>
>>> http://www.gluster.org/community/documentation/index.php/Libgfapi_with_qemu_libvirt
>>>>> ).
>>>>> Might be mistaken ofcourse. Proxmox does have a mountpoint for
>>>>> conveniance, I'll attach those
>>>>> logs, hoping they contain the informations you need. They do seem to
>>>>> contain a lot of errors
>>>>> for the 15.
>>>>> For reference, there was a disconnect of the first brick (10.10.0.1) in
>>>>> the morning and then a successfull
>>>>> heal that caused about 40 minutes downtime of the VMs. Right after that
>>>>> heal finished (if my memory is
>>>>> correct it was about noon or close) the second brick (10.10.0.2)
>>> rebooted,
>>>>> and that's the one I disconnected
>>>>> to prevent the heal from causing another downtime.
>>>>> I reconnected it one at the end of the afternoon, hoping the heal
>>> would go
>>>>> well but everything went down
>>>>> like in the morning so I disconnected it again, and waited 11pm
>>> (23:00) to
>>>>> reconnect it and let it finish.
>>>>>
>>>>> Thanks for your help,
>>>>>
>>>>>
>>>>> On Mon, Apr 18, 2016 at 12:28:28PM +0530, Krutika Dhananjay wrote:
>>>>>> Sorry, I was referring to the glusterfs client logs.
>>>>>>
>>>>>> Assuming you are using FUSE mount, your log file will be in
>>>>>> /var/log/glusterfs/<hyphenated-mount-point-path>.log
>>>>>>
>>>>>> -Krutika
>>>>>>
>>>>>> On Sun, Apr 17, 2016 at 9:37 PM, Kevin Lemonnier <
>>> lemonnierk at ulrar.net>
>>>>>> wrote:
>>>>>>
>>>>>>> I believe Proxmox is just an interface to KVM that uses the lib,
>>> so if
>>>>> I'm
>>>>>>> not mistaken there isn't client logs ?
>>>>>>>
>>>>>>> It's not the first time I have the issue, it happens on every heal
>>> on
>>>>> the
>>>>>>> 2 clusters I have.
>>>>>>>
>>>>>>> I did let the heal finish that night and the VMs are working now,
>>> but
>>>>> it
>>>>>>> is pretty scarry for future crashes or brick replacement.
>>>>>>> Should I maybe lower the shard size ? Won't solve the fact that 2
>>>>> bricks
>>>>>>> on 3 aren't keeping the filesystem usable but might make the
>>> healing
>>>>>>> quicker right ?
>>>>>>>
>>>>>>> Thanks
>>>>>>>
>>>>>>> Le 17 avril 2016 17:56:37 GMT+02:00, Krutika Dhananjay <
>>>>>>> kdhananj at redhat.com> a écrit :
>>>>>>>> Could you share the client logs and information about the approx
>>>>>>>> time/day
>>>>>>>> when you saw this issue?
>>>>>>>>
>>>>>>>> -Krutika
>>>>>>>>
>>>>>>>> On Sat, Apr 16, 2016 at 12:57 AM, Kevin Lemonnier
>>>>>>>> <lemonnierk at ulrar.net>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> We have a small glusterFS 3.7.6 cluster with 3 nodes running
>>> with
>>>>>>>> proxmox
>>>>>>>>> VM's on it. I did set up the different recommended option like
>>> the
>>>>>>>> virt
>>>>>>>>> group, but
>>>>>>>>> by hand since it's on debian. The shards are 256MB, if that
>>> matters.
>>>>>>>>> This morning the second node crashed, and as it came back up
>>> started
>>>>>>>> a
>>>>>>>>> heal, but that basically froze all the VM's running on that
>>> volume.
>>>>>>>> Since
>>>>>>>>> we really really
>>>>>>>>> can't have 40 minutes down time in the middle of the day, I just
>>>>>>>> removed
>>>>>>>>> the node from the network and that stopped the heal, allowing
>>> the
>>>>>>>> VM's to
>>>>>>>>> access
>>>>>>>>> their disks again. The plan was to re-connecte the node in a
>>> couple
>>>>>>>> of
>>>>>>>>> hours to let it heal at night.
>>>>>>>>> But a VM crashed now, and it can't boot up again : seems to
>>> freez
>>>>>>>> trying
>>>>>>>>> to access the disks.
>>>>>>>>>
>>>>>>>>> Looking at the heal info for the volume, it has gone way up
>>> since
>>>>>>>> this
>>>>>>>>> morning, it looks like the VM's aren't writing to both nodes,
>>> just
>>>>>>>> the one
>>>>>>>>> they are on.
>>>>>>>>> It seems pretty bad, we have 2 nodes on 3 up, I would expect the
>>>>>>>> volume to
>>>>>>>>> work just fine since it has quorum. What am I missing ?
>>>>>>>>>
>>>>>>>>> It is still too early to start the heal, is there a way to
>>> start the
>>>>>>>> VM
>>>>>>>>> anyway right now ? I mean, it was running a moment ago so the
>>> data
>>>>> is
>>>>>>>>> there, it just needs
>>>>>>>>> to let the VM access it.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Volume Name: vm-storage
>>>>>>>>> Type: Replicate
>>>>>>>>> Volume ID: a5b19324-f032-4136-aaac-5e9a4c88aaef
>>>>>>>>> Status: Started
>>>>>>>>> Number of Bricks: 1 x 3 = 3
>>>>>>>>> Transport-type: tcp
>>>>>>>>> Bricks:
>>>>>>>>> Brick1: first_node:/mnt/vg1-storage
>>>>>>>>> Brick2: second_node:/mnt/vg1-storage
>>>>>>>>> Brick3: third_node:/mnt/vg1-storage
>>>>>>>>> Options Reconfigured:
>>>>>>>>> cluster.quorum-type: auto
>>>>>>>>> cluster.server-quorum-type: server
>>>>>>>>> network.remote-dio: enable
>>>>>>>>> cluster.eager-lock: enable
>>>>>>>>> performance.readdir-ahead: on
>>>>>>>>> performance.quick-read: off
>>>>>>>>> performance.read-ahead: off
>>>>>>>>> performance.io-cache: off
>>>>>>>>> performance.stat-prefetch: off
>>>>>>>>> features.shard: on
>>>>>>>>> features.shard-block-size: 256MB
>>>>>>>>> cluster.server-quorum-ratio: 51%
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks for your help
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Kevin Lemonnier
>>>>>>>>> PGP Fingerprint : 89A5 2283 04A0 E6E9 0111
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> Gluster-users mailing list
>>>>>>>>> Gluster-users at gluster.org
>>>>>>>>> http://www.gluster.org/mailman/listinfo/gluster-users
>>>>>>>>>
>>>>>>> --
>>>>>>> Envoyé de mon appareil Android avec K-9 Mail. Veuillez excuser ma
>>>>> brièveté.
>>>>>>> _______________________________________________
>>>>>>> Gluster-users mailing list
>>>>>>> Gluster-users at gluster.org
>>>>>>> http://www.gluster.org/mailman/listinfo/gluster-users
>>>>>>>
>>>>> --
>>>>> Kevin Lemonnier
>>>>> PGP Fingerprint : 89A5 2283 04A0 E6E9 0111
>>>>>
>>>>> _______________________________________________
>>>>> Gluster-users mailing list
>>>>> Gluster-users at gluster.org
>>>>> http://www.gluster.org/mailman/listinfo/gluster-users
>>>>>
>>> --
>>> Kevin Lemonnier
>>> PGP Fingerprint : 89A5 2283 04A0 E6E9 0111
>>>
>>> _______________________________________________
>>> Gluster-users mailing list
>>> Gluster-users at gluster.org
>>> http://www.gluster.org/mailman/listinfo/gluster-users
>>>
>
>
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-users


-- 
Lindsay Mathieson

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20160425/25f1bdd5/attachment.html>


More information about the Gluster-users mailing list