[Gluster-users] Heal operation detail of EC volumes

Pranith Kumar Karampuri pkarampu at redhat.com
Thu Jun 8 07:21:25 UTC 2017


On Thu, Jun 8, 2017 at 12:49 PM, Pranith Kumar Karampuri <
pkarampu at redhat.com> wrote:

>
>
> On Fri, Jun 2, 2017 at 1:01 AM, Serkan Çoban <cobanserkan at gmail.com>
> wrote:
>
>> >Is it possible that this matches your observations ?
>> Yes that matches what I see. So 19 files is being in parallel by 19
>> SHD processes. I thought only one file is being healed at a time.
>> Then what is the meaning of disperse.shd-max-threads parameter? If I
>> set it to 2 then each SHD thread will heal two files at the same time?
>>
>
> Yes that is the idea.
>

One small correction. So if you have n*(16+4) and the server has at least
one brick contributing to these n subvolumes, then the number of heals it
will do will be 'n' and if you set the max-threads to 2 then it will be 2n.
So the option is per EC subvolume.


>
>
>>
>> >How many IOPS can handle your bricks ?
>> Bricks are 7200RPM NL-SAS disks. 70-80 random IOPS max. But write
>> pattern seems sequential, 30-40MB bulk writes every 4-5 seconds.
>> This is what iostat shows.
>>
>> >Do you have a test environment where we could check all this ?
>> Not currently but will have in 4-5 weeks. New servers are arriving, I
>> will add this test to my notes.
>>
>> > There's a feature to allow to configure the self-heal block size to
>> optimize these cases. The feature is available on 3.11.
>> I did not see this in 3.11 release notes, what parameter name I should
>> look for?
>>
>
> disperse.self-heal-window-size
>
>
> +    { .key  = {"self-heal-window-size"},
> +        .type = GF_OPTION_TYPE_INT,
> +        .min  = 1,
> +        .max  = 1024,
> +        .default_value = "1",
> +        .description = "Maximum number blocks(128KB) per file for which "
> +                       "self-heal process would be applied
> simultaneously."
> +    },
>
> This is the patch: https://review.gluster.org/17098
>
> +Sunil,
>      Could you add this to release notes please?
>
>>
>>
>>
>> On Thu, Jun 1, 2017 at 10:30 AM, Xavier Hernandez <xhernandez at datalab.es>
>> wrote:
>> > Hi Serkan,
>> >
>> > On 30/05/17 10:22, Serkan Çoban wrote:
>> >>
>> >> Ok I understand that heal operation takes place on server side. In
>> >> this case I should see X KB
>> >>  out network traffic from 16 servers and 16X KB input traffic to the
>> >> failed brick server right? So that process will get 16 chunks
>> >> recalculate our chunk and write it to disk.
>> >
>> >
>> > That should be the normal operation for a single heal.
>> >
>> >> The problem is I am not seeing such kind of traffic on servers. In my
>> >> configuration (16+4 EC) I see 20 servers are all have 7-8MB outbound
>> >> traffic and none of them has more than 10MB incoming traffic.
>> >> Only heal operation is happening on cluster right now, no client/other
>> >> traffic. I see constant 7-8MB write to healing brick disk. So where is
>> >> the missing traffic?
>> >
>> >
>> > Not sure about your configuration, but probably you are seeing the
>> result of
>> > having the SHD of each server doing heals. That would explain the
>> network
>> > traffic you have.
>> >
>> > Suppose that all SHD but the one on the damaged brick are working. In
>> this
>> > case 19 servers will peek 16 fragments each. This gives 19 * 16 = 304
>> > fragments to be requested. EC balances the reads among all available
>> > servers, and there's a chance (1/19) that a fragment is local to the
>> server
>> > asking it. So we'll need a total of 304 - 304 / 19 = 288 network
>> requests,
>> > 288 / 19 = 15.2 sent by each server.
>> >
>> > If we have a total of 288 requests, it means that each server will
>> answer
>> > 288 / 19 = 15.2 requests. The net effect of all this is that each
>> healthy
>> > server is sending 15.2*X bytes of data and each server is receiving
>> 15.2*X
>> > bytes of data.
>> >
>> > Now we need to account for the writes to the damaged brick. We have 19
>> > simultaneous heals. This means that the damaged brick will receive 19*X
>> > bytes of data, and each healthy server will send X additional bytes of
>> data.
>> >
>> > So:
>> >
>> > A healthy server receives 15.2*X bytes of data
>> > A healthy server sends 16.2*X bytes of data
>> > A damaged server receives 19*X bytes of data
>> > A damaged server sends few bytes of data (communication and
>> synchronization
>> > overhead basically)
>> >
>> > As you can see, in this configuration each server has almost the same
>> amount
>> > of inbound and outbound traffic. Only big difference is the damaged
>> brick,
>> > that should receive a little more of traffic, but it should send much
>> less.
>> >
>> > Is it possible that this matches your observations ?
>> >
>> > There's one more thing to consider here, and it's the apparent low
>> > throughput of self-heal. One possible thing to check is the small size
>> and
>> > random behavior of the requests.
>> >
>> > Assuming that each request has a size of ~128 / 16 = 8KB, at a rate of
>> ~8
>> > MB/s the servers are processing ~1000 IOPS. Since requests are going to
>> 19
>> > different files, even if each file is accessed sequentially, the real
>> effect
>> > will be like random access (some read-ahead on the filesystem can
>> improve
>> > reads a bit, but writes won't benefit so much).
>> >
>> > How many IOPS can handle your bricks ?
>> >
>> > Do you have a test environment where we could check all this ? if
>> possible
>> > it would be interesting to have only a single SHD (kill all SHD from all
>> > servers but one). In this situation, without client accesses, we should
>> see
>> > the 16/1 ratio of reads vs writes on the network. We should also see a
>> > similar of even a little better speed because all reads and writes will
>> be
>> > sequential, optimizing available IOPS.
>> >
>> > There's a feature to allow to configure the self-heal block size to
>> optimize
>> > these cases. The feature is available on 3.11.
>> >
>> > Best regards,
>> >
>> > Xavi
>> >
>> >
>> >>
>> >> On Tue, May 30, 2017 at 10:25 AM, Ashish Pandey <aspandey at redhat.com>
>> >> wrote:
>> >>>
>> >>>
>> >>> When we say client side heal or server side heal, we basically talking
>> >>> about
>> >>> the side which "triggers" heal of a file.
>> >>>
>> >>> 1 - server side heal - shd scans indices and triggers heal
>> >>>
>> >>> 2 - client side heal - a fop finds that file needs heal and it
>> triggers
>> >>> heal
>> >>> for that file.
>> >>>
>> >>> Now, what happens when heal gets triggered.
>> >>> In both  the cases following functions takes part -
>> >>>
>> >>> ec_heal => ec_heal_throttle=>ec_launch_heal
>> >>>
>> >>> Now ec_launch_heal just creates heal tasks (with ec_synctask_heal_wrap
>> >>> which
>> >>> calls ec_heal_do ) and put it into a queue.
>> >>> This happens on server and "syncenv" infrastructure which is nothing
>> but
>> >>> a
>> >>> set of workers pick these tasks and execute it. That is when actual
>> >>> read/write for
>> >>> heal happens.
>> >>>
>> >>>
>> >>> ________________________________
>> >>> From: "Serkan Çoban" <cobanserkan at gmail.com>
>> >>> To: "Ashish Pandey" <aspandey at redhat.com>
>> >>> Cc: "Gluster Users" <gluster-users at gluster.org>
>> >>> Sent: Monday, May 29, 2017 6:44:50 PM
>> >>> Subject: Re: [Gluster-users] Heal operation detail of EC volumes
>> >>>
>> >>>
>> >>>>> Healing could be triggered by client side (access of file) or server
>> >>>>> side
>> >>>>> (shd).
>> >>>>> However, in both the cases actual heal starts from "ec_heal_do"
>> >>>>> function.
>> >>>
>> >>> If I do a recursive getfattr operation from clients, then all heal
>> >>> operation is done on clients right? Client read the chunks, calculate
>> >>> and write the missing chunk.
>> >>> And If I don't access files from client then SHD daemons will start
>> >>> heal and read,calculate,write the missing chunks right?
>> >>>
>> >>> In first case EC calculations takes places in client fuse process, in
>> >>> second case EC calculations will be made in SHD process right?
>> >>> Does brick process has any role in EC calculations?
>> >>>
>> >>> On Mon, May 29, 2017 at 3:32 PM, Ashish Pandey <aspandey at redhat.com>
>> >>> wrote:
>> >>>>
>> >>>>
>> >>>>
>> >>>> ________________________________
>> >>>> From: "Serkan Çoban" <cobanserkan at gmail.com>
>> >>>> To: "Gluster Users" <gluster-users at gluster.org>
>> >>>> Sent: Monday, May 29, 2017 5:13:06 PM
>> >>>> Subject: [Gluster-users] Heal operation detail of EC volumes
>> >>>>
>> >>>> Hi,
>> >>>>
>> >>>> When a brick fails in EC, What is the healing read/write data path?
>> >>>> Which processes do the operations?
>> >>>>
>> >>>> Healing could be triggered by client side (access of file) or server
>> >>>> side
>> >>>> (shd).
>> >>>> However, in both the cases actual heal starts from "ec_heal_do"
>> >>>> function.
>> >>>>
>> >>>>
>> >>>> Assume a 2GB file is being healed in 16+4 EC configuration. I was
>> >>>> thinking that SHD deamon on failed brick host will read 2GB from
>> >>>> network and reconstruct its 100MB chunk and write it on to brick. Is
>> >>>> this right?
>> >>>>
>> >>>> You are correct about read/write.
>> >>>> The only point is that, SHD deamon on one of the good brick will pick
>> >>>> the
>> >>>> index entry and heal it.
>> >>>> SHD deamon scans the .glusterfs/index directory and heals the
>> entries.
>> >>>> If
>> >>>> the brick went down while IO was going on, index will be present on
>> >>>> killed
>> >>>> brick also.
>> >>>> However, if a brick was down and then you started writing on a file
>> then
>> >>>> in
>> >>>> this case index entry would not be present on killed brick.
>> >>>> So even after brick will be  UP, sdh on that brick will not be able
>> to
>> >>>> find
>> >>>> it out this index. However, other bricks would have entries and shd
>> on
>> >>>> that
>> >>>> brick will heal it.
>> >>>>
>> >>>> Note: I am considering each brick on different node.
>> >>>>
>> >>>> Ashish
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>> _______________________________________________
>> >>>> Gluster-users mailing list
>> >>>> Gluster-users at gluster.org
>> >>>> http://lists.gluster.org/mailman/listinfo/gluster-users
>> >>>>
>> >>> _______________________________________________
>> >>> Gluster-users mailing list
>> >>> Gluster-users at gluster.org
>> >>> http://lists.gluster.org/mailman/listinfo/gluster-users
>> >>>
>> >> _______________________________________________
>> >> Gluster-users mailing list
>> >> Gluster-users at gluster.org
>> >> http://lists.gluster.org/mailman/listinfo/gluster-users
>> >>
>> >
>> _______________________________________________
>> Gluster-users mailing list
>> Gluster-users at gluster.org
>> http://lists.gluster.org/mailman/listinfo/gluster-users
>>
>
>
>
> --
> Pranith
>



-- 
Pranith
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20170608/d66777c9/attachment.html>


More information about the Gluster-users mailing list