[Gluster-users] ?==?utf-8?q? Heal operation detail of EC volumes

Fri Jun 2 20:15:40 UTC 2017

Hi Serkan,

On Thursday, June 01, 2017 21:31 CEST, Serkan Çoban <cobanserkan at gmail.com> wrote:
 >Is it possible that this matches your observations ?
Yes that matches what I see. So 19 files is being in parallel by 19
SHD processes. I thought only one file is being healed at a time.
Then what is the meaning of disperse.shd-max-threads parameter? If I
set it to 2 then each SHD thread will heal two files at the same time?Each SHD normally heals a single file at a time. However there's an SHD on each node so all of them are trying to process dirty files. If one peeks one file to heal, other SHD's will skip that one and try another.

The disperse.shd-max-threads indicates how many heals can do each SHD simultaneously. Setting a value of 2 would mean that each SHD could heal 2 files, up to 40 using 20 nodes.
>How many IOPS can handle your bricks ?
Bricks are 7200RPM NL-SAS disks. 70-80 random IOPS max. But write
pattern seems sequential, 30-40MB bulk writes every 4-5 seconds.
This is what iostat shows.This is probably caused by some write back policy on the file system that accumulates multiple writes, optimizing disk access. This way the apparent 1000 IOPS can be handled by a single disk with 70-80 real IOPS by making each IO operation bigger.
>Do you have a test environment where we could check all this ?
Not currently but will have in 4-5 weeks. New servers are arriving, I
will add this test to my notes.

> There's a feature to allow to configure the self-heal block size to optimize these cases. The feature is available on 3.11.
I did not see this in 3.11 release notes, what parameter name I should look for?The new option is named 'self-heal-window-size'. It represents the size of each heal operation as the number of 128KB blocks to use. The default value is 1. To use blocks of 1MB, this parameter should be set to 8.

Xavi
On Thu, Jun 1, 2017 at 10:30 AM, Xavier Hernandez <xhernandez at datalab.es> wrote:
> Hi Serkan,
>
> On 30/05/17 10:22, Serkan Çoban wrote:
>>
>> Ok I understand that heal operation takes place on server side. In
>> this case I should see X KB
>> out network traffic from 16 servers and 16X KB input traffic to the
>> failed brick server right? So that process will get 16 chunks
>> recalculate our chunk and write it to disk.
>
>
> That should be the normal operation for a single heal.
>
>> The problem is I am not seeing such kind of traffic on servers. In my
>> configuration (16+4 EC) I see 20 servers are all have 7-8MB outbound
>> traffic and none of them has more than 10MB incoming traffic.
>> Only heal operation is happening on cluster right now, no client/other
>> traffic. I see constant 7-8MB write to healing brick disk. So where is
>> the missing traffic?
>
>
> Not sure about your configuration, but probably you are seeing the result of
> having the SHD of each server doing heals. That would explain the network
> traffic you have.
>
> Suppose that all SHD but the one on the damaged brick are working. In this
> case 19 servers will peek 16 fragments each. This gives 19 * 16 = 304
> fragments to be requested. EC balances the reads among all available
> servers, and there's a chance (1/19) that a fragment is local to the server
> asking it. So we'll need a total of 304 - 304 / 19 = 288 network requests,
> 288 / 19 = 15.2 sent by each server.
>
> If we have a total of 288 requests, it means that each server will answer
> 288 / 19 = 15.2 requests. The net effect of all this is that each healthy
> server is sending 15.2*X bytes of data and each server is receiving 15.2*X
> bytes of data.
>
> Now we need to account for the writes to the damaged brick. We have 19
> simultaneous heals. This means that the damaged brick will receive 19*X
> bytes of data, and each healthy server will send X additional bytes of data.
>
> So:
>
> A healthy server receives 15.2*X bytes of data
> A healthy server sends 16.2*X bytes of data
> A damaged server receives 19*X bytes of data
> A damaged server sends few bytes of data (communication and synchronization
> overhead basically)
>
> As you can see, in this configuration each server has almost the same amount
> of inbound and outbound traffic. Only big difference is the damaged brick,
> that should receive a little more of traffic, but it should send much less.
>
> Is it possible that this matches your observations ?
>
> There's one more thing to consider here, and it's the apparent low
> throughput of self-heal. One possible thing to check is the small size and
> random behavior of the requests.
>
> Assuming that each request has a size of ~128 / 16 = 8KB, at a rate of ~8
> MB/s the servers are processing ~1000 IOPS. Since requests are going to 19
> different files, even if each file is accessed sequentially, the real effect
> will be like random access (some read-ahead on the filesystem can improve
> reads a bit, but writes won't benefit so much).
>
> How many IOPS can handle your bricks ?
>
> Do you have a test environment where we could check all this ? if possible
> it would be interesting to have only a single SHD (kill all SHD from all
> servers but one). In this situation, without client accesses, we should see
> the 16/1 ratio of reads vs writes on the network. We should also see a
> similar of even a little better speed because all reads and writes will be
> sequential, optimizing available IOPS.
>
> There's a feature to allow to configure the self-heal block size to optimize
> these cases. The feature is available on 3.11.
>
> Best regards,
>
> Xavi
>
>
>>
>> On Tue, May 30, 2017 at 10:25 AM, Ashish Pandey <aspandey at redhat.com>
>> wrote:
>>>
>>>
>>> When we say client side heal or server side heal, we basically talking
>>> about
>>> the side which "triggers" heal of a file.
>>>
>>> 1 - server side heal - shd scans indices and triggers heal
>>>
>>> 2 - client side heal - a fop finds that file needs heal and it triggers
>>> heal
>>> for that file.
>>>
>>> Now, what happens when heal gets triggered.
>>> In both the cases following functions takes part -
>>>
>>> ec_heal => ec_heal_throttle=>ec_launch_heal
>>>
>>> Now ec_launch_heal just creates heal tasks (with ec_synctask_heal_wrap
>>> which
>>> calls ec_heal_do ) and put it into a queue.
>>> This happens on server and "syncenv" infrastructure which is nothing but
>>> a
>>> set of workers pick these tasks and execute it. That is when actual
>>> read/write for
>>> heal happens.
>>>
>>>
>>> ________________________________
>>> From: "Serkan Çoban" <cobanserkan at gmail.com>
>>> To: "Ashish Pandey" <aspandey at redhat.com>
>>> Cc: "Gluster Users" <gluster-users at gluster.org>
>>> Sent: Monday, May 29, 2017 6:44:50 PM
>>> Subject: Re: [Gluster-users] Heal operation detail of EC volumes
>>>
>>>
>>>>> Healing could be triggered by client side (access of file) or server
>>>>> side
>>>>> (shd).
>>>>> However, in both the cases actual heal starts from "ec_heal_do"
>>>>> function.
>>>
>>> If I do a recursive getfattr operation from clients, then all heal
>>> operation is done on clients right? Client read the chunks, calculate
>>> and write the missing chunk.
>>> And If I don't access files from client then SHD daemons will start
>>> heal and read,calculate,write the missing chunks right?
>>>
>>> In first case EC calculations takes places in client fuse process, in
>>> second case EC calculations will be made in SHD process right?
>>> Does brick process has any role in EC calculations?
>>>
>>> On Mon, May 29, 2017 at 3:32 PM, Ashish Pandey <aspandey at redhat.com>
>>> wrote:
>>>>
>>>>
>>>>
>>>> ________________________________
>>>> From: "Serkan Çoban" <cobanserkan at gmail.com>
>>>> To: "Gluster Users" <gluster-users at gluster.org>
>>>> Sent: Monday, May 29, 2017 5:13:06 PM
>>>> Subject: [Gluster-users] Heal operation detail of EC volumes
>>>>
>>>> Hi,
>>>>
>>>> When a brick fails in EC, What is the healing read/write data path?
>>>> Which processes do the operations?
>>>>
>>>> Healing could be triggered by client side (access of file) or server
>>>> side
>>>> (shd).
>>>> However, in both the cases actual heal starts from "ec_heal_do"
>>>> function.
>>>>
>>>>
>>>> Assume a 2GB file is being healed in 16+4 EC configuration. I was
>>>> thinking that SHD deamon on failed brick host will read 2GB from
>>>> network and reconstruct its 100MB chunk and write it on to brick. Is
>>>> this right?
>>>>
>>>> You are correct about read/write.
>>>> The only point is that, SHD deamon on one of the good brick will pick
>>>> the
>>>> index entry and heal it.
>>>> SHD deamon scans the .glusterfs/index directory and heals the entries.
>>>> If
>>>> the brick went down while IO was going on, index will be present on
>>>> killed
>>>> brick also.
>>>> However, if a brick was down and then you started writing on a file then
>>>> in
>>>> this case index entry would not be present on killed brick.
>>>> So even after brick will be UP, sdh on that brick will not be able to
>>>> find
>>>> it out this index. However, other bricks would have entries and shd on
>>>> that
>>>> brick will heal it.
>>>>
>>>> Note: I am considering each brick on different node.
>>>>
>>>> Ashish
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Gluster-users mailing list
>>>> Gluster-users at gluster.org
>>>> http://lists.gluster.org/mailman/listinfo/gluster-users
>>>>
>>> _______________________________________________
>>> Gluster-users mailing list
>>> Gluster-users at gluster.org
>>> http://lists.gluster.org/mailman/listinfo/gluster-users
>>>
>> _______________________________________________
>> Gluster-users mailing list
>> Gluster-users at gluster.org
>> http://lists.gluster.org/mailman/listinfo/gluster-users
>>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20170602/4a68e529/attachment.html>