[Gluster-users] ?==?utf-8?q? Heal operation detail of EC volumes

Pranith Kumar Karampuri pkarampu at redhat.com
Thu Jun 8 07:24:29 UTC 2017


This mail was not there in the same thread as earlier because the subject
has extra "?==?utf-8?q? " so thought it was not answered and answered
again. Sorry about that.

On Sat, Jun 3, 2017 at 1:45 AM, Xavier Hernandez <xhernandez at datalab.es>
wrote:

> Hi Serkan,
>
> On Thursday, June 01, 2017 21:31 CEST, Serkan Çoban <cobanserkan at gmail.com>
> wrote:
>
>
> >Is it possible that this matches your observations ?
> Yes that matches what I see. So 19 files is being in parallel by 19
> SHD processes. I thought only one file is being healed at a time.
> Then what is the meaning of disperse.shd-max-threads parameter? If I
> set it to 2 then each SHD thread will heal two files at the same time?
>
> Each SHD normally heals a single file at a time. However there's an SHD on
> each node so all of them are trying to process dirty files. If one peeks
> one file to heal, other SHD's will skip that one and try another.
>
> The disperse.shd-max-threads indicates how many heals can do each SHD
> simultaneously. Setting a value of 2 would mean that each SHD could heal 2
> files, up to 40 using 20 nodes.
>
>
> >How many IOPS can handle your bricks ?
> Bricks are 7200RPM NL-SAS disks. 70-80 random IOPS max. But write
> pattern seems sequential, 30-40MB bulk writes every 4-5 seconds.
> This is what iostat shows.
>
> This is probably caused by some write back policy on the file system that
> accumulates multiple writes, optimizing disk access. This way the apparent
> 1000 IOPS can be handled by a single disk with 70-80 real IOPS by making
> each IO operation bigger.
>
>
> >Do you have a test environment where we could check all this ?
> Not currently but will have in 4-5 weeks. New servers are arriving, I
> will add this test to my notes.
>
> > There's a feature to allow to configure the self-heal block size to
> optimize these cases. The feature is available on 3.11.
> I did not see this in 3.11 release notes, what parameter name I should
> look for?
>
> The new option is named 'self-heal-window-size'. It represents the size of
> each heal operation as the number of 128KB blocks to use. The default value
> is 1. To use blocks of 1MB, this parameter should be set to 8.
>
> ​​​​​​​Xavi
>
>
> On Thu, Jun 1, 2017 at 10:30 AM, Xavier Hernandez <xhernandez at datalab.es>
> wrote:
> > Hi Serkan,
> >
> > On 30/05/17 10:22, Serkan Çoban wrote:
> >>
> >> Ok I understand that heal operation takes place on server side. In
> >> this case I should see X KB
> >> out network traffic from 16 servers and 16X KB input traffic to the
> >> failed brick server right? So that process will get 16 chunks
> >> recalculate our chunk and write it to disk.
> >
> >
> > That should be the normal operation for a single heal.
> >
> >> The problem is I am not seeing such kind of traffic on servers. In my
> >> configuration (16+4 EC) I see 20 servers are all have 7-8MB outbound
> >> traffic and none of them has more than 10MB incoming traffic.
> >> Only heal operation is happening on cluster right now, no client/other
> >> traffic. I see constant 7-8MB write to healing brick disk. So where is
> >> the missing traffic?
> >
> >
> > Not sure about your configuration, but probably you are seeing the
> result of
> > having the SHD of each server doing heals. That would explain the network
> > traffic you have.
> >
> > Suppose that all SHD but the one on the damaged brick are working. In
> this
> > case 19 servers will peek 16 fragments each. This gives 19 * 16 = 304
> > fragments to be requested. EC balances the reads among all available
> > servers, and there's a chance (1/19) that a fragment is local to the
> server
> > asking it. So we'll need a total of 304 - 304 / 19 = 288 network
> requests,
> > 288 / 19 = 15.2 sent by each server.
> >
> > If we have a total of 288 requests, it means that each server will answer
> > 288 / 19 = 15.2 requests. The net effect of all this is that each healthy
> > server is sending 15.2*X bytes of data and each server is receiving
> 15.2*X
> > bytes of data.
> >
> > Now we need to account for the writes to the damaged brick. We have 19
> > simultaneous heals. This means that the damaged brick will receive 19*X
> > bytes of data, and each healthy server will send X additional bytes of
> data.
> >
> > So:
> >
> > A healthy server receives 15.2*X bytes of data
> > A healthy server sends 16.2*X bytes of data
> > A damaged server receives 19*X bytes of data
> > A damaged server sends few bytes of data (communication and
> synchronization
> > overhead basically)
> >
> > As you can see, in this configuration each server has almost the same
> amount
> > of inbound and outbound traffic. Only big difference is the damaged
> brick,
> > that should receive a little more of traffic, but it should send much
> less.
> >
> > Is it possible that this matches your observations ?
> >
> > There's one more thing to consider here, and it's the apparent low
> > throughput of self-heal. One possible thing to check is the small size
> and
> > random behavior of the requests.
> >
> > Assuming that each request has a size of ~128 / 16 = 8KB, at a rate of ~8
> > MB/s the servers are processing ~1000 IOPS. Since requests are going to
> 19
> > different files, even if each file is accessed sequentially, the real
> effect
> > will be like random access (some read-ahead on the filesystem can improve
> > reads a bit, but writes won't benefit so much).
> >
> > How many IOPS can handle your bricks ?
> >
> > Do you have a test environment where we could check all this ? if
> possible
> > it would be interesting to have only a single SHD (kill all SHD from all
> > servers but one). In this situation, without client accesses, we should
> see
> > the 16/1 ratio of reads vs writes on the network. We should also see a
> > similar of even a little better speed because all reads and writes will
> be
> > sequential, optimizing available IOPS.
> >
> > There's a feature to allow to configure the self-heal block size to
> optimize
> > these cases. The feature is available on 3.11.
> >
> > Best regards,
> >
> > Xavi
> >
> >
> >>
> >> On Tue, May 30, 2017 at 10:25 AM, Ashish Pandey <aspandey at redhat.com>
> >> wrote:
> >>>
> >>>
> >>> When we say client side heal or server side heal, we basically talking
> >>> about
> >>> the side which "triggers" heal of a file.
> >>>
> >>> 1 - server side heal - shd scans indices and triggers heal
> >>>
> >>> 2 - client side heal - a fop finds that file needs heal and it triggers
> >>> heal
> >>> for that file.
> >>>
> >>> Now, what happens when heal gets triggered.
> >>> In both the cases following functions takes part -
> >>>
> >>> ec_heal => ec_heal_throttle=>ec_launch_heal
> >>>
> >>> Now ec_launch_heal just creates heal tasks (with ec_synctask_heal_wrap
> >>> which
> >>> calls ec_heal_do ) and put it into a queue.
> >>> This happens on server and "syncenv" infrastructure which is nothing
> but
> >>> a
> >>> set of workers pick these tasks and execute it. That is when actual
> >>> read/write for
> >>> heal happens.
> >>>
> >>>
> >>> ________________________________
> >>> From: "Serkan Çoban" <cobanserkan at gmail.com>
> >>> To: "Ashish Pandey" <aspandey at redhat.com>
> >>> Cc: "Gluster Users" <gluster-users at gluster.org>
> >>> Sent: Monday, May 29, 2017 6:44:50 PM
> >>> Subject: Re: [Gluster-users] Heal operation detail of EC volumes
> >>>
> >>>
> >>>>> Healing could be triggered by client side (access of file) or server
> >>>>> side
> >>>>> (shd).
> >>>>> However, in both the cases actual heal starts from "ec_heal_do"
> >>>>> function.
> >>>
> >>> If I do a recursive getfattr operation from clients, then all heal
> >>> operation is done on clients right? Client read the chunks, calculate
> >>> and write the missing chunk.
> >>> And If I don't access files from client then SHD daemons will start
> >>> heal and read,calculate,write the missing chunks right?
> >>>
> >>> In first case EC calculations takes places in client fuse process, in
> >>> second case EC calculations will be made in SHD process right?
> >>> Does brick process has any role in EC calculations?
> >>>
> >>> On Mon, May 29, 2017 at 3:32 PM, Ashish Pandey <aspandey at redhat.com>
> >>> wrote:
> >>>>
> >>>>
> >>>>
> >>>> ________________________________
> >>>> From: "Serkan Çoban" <cobanserkan at gmail.com>
> >>>> To: "Gluster Users" <gluster-users at gluster.org>
> >>>> Sent: Monday, May 29, 2017 5:13:06 PM
> >>>> Subject: [Gluster-users] Heal operation detail of EC volumes
> >>>>
> >>>> Hi,
> >>>>
> >>>> When a brick fails in EC, What is the healing read/write data path?
> >>>> Which processes do the operations?
> >>>>
> >>>> Healing could be triggered by client side (access of file) or server
> >>>> side
> >>>> (shd).
> >>>> However, in both the cases actual heal starts from "ec_heal_do"
> >>>> function.
> >>>>
> >>>>
> >>>> Assume a 2GB file is being healed in 16+4 EC configuration. I was
> >>>> thinking that SHD deamon on failed brick host will read 2GB from
> >>>> network and reconstruct its 100MB chunk and write it on to brick. Is
> >>>> this right?
> >>>>
> >>>> You are correct about read/write.
> >>>> The only point is that, SHD deamon on one of the good brick will pick
> >>>> the
> >>>> index entry and heal it.
> >>>> SHD deamon scans the .glusterfs/index directory and heals the entries.
> >>>> If
> >>>> the brick went down while IO was going on, index will be present on
> >>>> killed
> >>>> brick also.
> >>>> However, if a brick was down and then you started writing on a file
> then
> >>>> in
> >>>> this case index entry would not be present on killed brick.
> >>>> So even after brick will be UP, sdh on that brick will not be able to
> >>>> find
> >>>> it out this index. However, other bricks would have entries and shd on
> >>>> that
> >>>> brick will heal it.
> >>>>
> >>>> Note: I am considering each brick on different node.
> >>>>
> >>>> Ashish
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> _______________________________________________
> >>>> Gluster-users mailing list
> >>>> Gluster-users at gluster.org
> >>>> http://lists.gluster.org/mailman/listinfo/gluster-users
> >>>>
> >>> _______________________________________________
> >>> Gluster-users mailing list
> >>> Gluster-users at gluster.org
> >>> http://lists.gluster.org/mailman/listinfo/gluster-users
> >>>
> >> _______________________________________________
> >> Gluster-users mailing list
> >> Gluster-users at gluster.org
> >> http://lists.gluster.org/mailman/listinfo/gluster-users
> >>
> >
>
>
>
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-users
>



-- 
Pranith
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20170608/1a965645/attachment.html>


More information about the Gluster-users mailing list