<div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">On Fri, Jun 2, 2017 at 1:01 AM, Serkan Çoban <span dir="ltr"><<a href="mailto:cobanserkan@gmail.com" target="_blank">cobanserkan@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><span class="gmail-">>Is it possible that this matches your observations ?<br>
</span>Yes that matches what I see. So 19 files is being in parallel by 19<br>
SHD processes. I thought only one file is being healed at a time.<br>
Then what is the meaning of disperse.shd-max-threads parameter? If I<br>
set it to 2 then each SHD thread will heal two files at the same time?<br></blockquote><div><br></div><div>Yes that is the idea.<br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<span class="gmail-"><br>
>How many IOPS can handle your bricks ?<br>
</span>Bricks are 7200RPM NL-SAS disks. 70-80 random IOPS max. But write<br>
pattern seems sequential, 30-40MB bulk writes every 4-5 seconds.<br>
This is what iostat shows.<br>
<span class="gmail-"><br>
>Do you have a test environment where we could check all this ?<br>
</span>Not currently but will have in 4-5 weeks. New servers are arriving, I<br>
will add this test to my notes.<br>
<span class="gmail-"><br>
> There's a feature to allow to configure the self-heal block size to optimize these cases. The feature is available on 3.11.<br>
</span>I did not see this in 3.11 release notes, what parameter name I should look for?<br></blockquote><div><br>disperse.self-heal-window-size<br><br><br>+ { .key = {"self-heal-window-size"},<br>+ .type = GF_OPTION_TYPE_INT,<br>+ .min = 1,<br>+ .max = 1024,<br>+ .default_value = "1",<br>+ .description = "Maximum number blocks(128KB) per file for which "<br>+ "self-heal process would be applied simultaneously."<br>+ },<br><br></div><div>This is the patch: <a href="https://review.gluster.org/17098">https://review.gluster.org/17098</a><br></div><div><br></div><div>+Sunil,<br></div><div> Could you add this to release notes please?<br> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div class="gmail-HOEnZb"><div class="gmail-h5"><br>
<br>
<br>
On Thu, Jun 1, 2017 at 10:30 AM, Xavier Hernandez <<a href="mailto:xhernandez@datalab.es">xhernandez@datalab.es</a>> wrote:<br>
> Hi Serkan,<br>
><br>
> On 30/05/17 10:22, Serkan Çoban wrote:<br>
>><br>
>> Ok I understand that heal operation takes place on server side. In<br>
>> this case I should see X KB<br>
>> out network traffic from 16 servers and 16X KB input traffic to the<br>
>> failed brick server right? So that process will get 16 chunks<br>
>> recalculate our chunk and write it to disk.<br>
><br>
><br>
> That should be the normal operation for a single heal.<br>
><br>
>> The problem is I am not seeing such kind of traffic on servers. In my<br>
>> configuration (16+4 EC) I see 20 servers are all have 7-8MB outbound<br>
>> traffic and none of them has more than 10MB incoming traffic.<br>
>> Only heal operation is happening on cluster right now, no client/other<br>
>> traffic. I see constant 7-8MB write to healing brick disk. So where is<br>
>> the missing traffic?<br>
><br>
><br>
> Not sure about your configuration, but probably you are seeing the result of<br>
> having the SHD of each server doing heals. That would explain the network<br>
> traffic you have.<br>
><br>
> Suppose that all SHD but the one on the damaged brick are working. In this<br>
> case 19 servers will peek 16 fragments each. This gives 19 * 16 = 304<br>
> fragments to be requested. EC balances the reads among all available<br>
> servers, and there's a chance (1/19) that a fragment is local to the server<br>
> asking it. So we'll need a total of 304 - 304 / 19 = 288 network requests,<br>
> 288 / 19 = 15.2 sent by each server.<br>
><br>
> If we have a total of 288 requests, it means that each server will answer<br>
> 288 / 19 = 15.2 requests. The net effect of all this is that each healthy<br>
> server is sending 15.2*X bytes of data and each server is receiving 15.2*X<br>
> bytes of data.<br>
><br>
> Now we need to account for the writes to the damaged brick. We have 19<br>
> simultaneous heals. This means that the damaged brick will receive 19*X<br>
> bytes of data, and each healthy server will send X additional bytes of data.<br>
><br>
> So:<br>
><br>
> A healthy server receives 15.2*X bytes of data<br>
> A healthy server sends 16.2*X bytes of data<br>
> A damaged server receives 19*X bytes of data<br>
> A damaged server sends few bytes of data (communication and synchronization<br>
> overhead basically)<br>
><br>
> As you can see, in this configuration each server has almost the same amount<br>
> of inbound and outbound traffic. Only big difference is the damaged brick,<br>
> that should receive a little more of traffic, but it should send much less.<br>
><br>
> Is it possible that this matches your observations ?<br>
><br>
> There's one more thing to consider here, and it's the apparent low<br>
> throughput of self-heal. One possible thing to check is the small size and<br>
> random behavior of the requests.<br>
><br>
> Assuming that each request has a size of ~128 / 16 = 8KB, at a rate of ~8<br>
> MB/s the servers are processing ~1000 IOPS. Since requests are going to 19<br>
> different files, even if each file is accessed sequentially, the real effect<br>
> will be like random access (some read-ahead on the filesystem can improve<br>
> reads a bit, but writes won't benefit so much).<br>
><br>
> How many IOPS can handle your bricks ?<br>
><br>
> Do you have a test environment where we could check all this ? if possible<br>
> it would be interesting to have only a single SHD (kill all SHD from all<br>
> servers but one). In this situation, without client accesses, we should see<br>
> the 16/1 ratio of reads vs writes on the network. We should also see a<br>
> similar of even a little better speed because all reads and writes will be<br>
> sequential, optimizing available IOPS.<br>
><br>
> There's a feature to allow to configure the self-heal block size to optimize<br>
> these cases. The feature is available on 3.11.<br>
><br>
> Best regards,<br>
><br>
> Xavi<br>
><br>
><br>
>><br>
>> On Tue, May 30, 2017 at 10:25 AM, Ashish Pandey <<a href="mailto:aspandey@redhat.com">aspandey@redhat.com</a>><br>
>> wrote:<br>
>>><br>
>>><br>
>>> When we say client side heal or server side heal, we basically talking<br>
>>> about<br>
>>> the side which "triggers" heal of a file.<br>
>>><br>
>>> 1 - server side heal - shd scans indices and triggers heal<br>
>>><br>
>>> 2 - client side heal - a fop finds that file needs heal and it triggers<br>
>>> heal<br>
>>> for that file.<br>
>>><br>
>>> Now, what happens when heal gets triggered.<br>
>>> In both the cases following functions takes part -<br>
>>><br>
>>> ec_heal => ec_heal_throttle=>ec_launch_<wbr>heal<br>
>>><br>
>>> Now ec_launch_heal just creates heal tasks (with ec_synctask_heal_wrap<br>
>>> which<br>
>>> calls ec_heal_do ) and put it into a queue.<br>
>>> This happens on server and "syncenv" infrastructure which is nothing but<br>
>>> a<br>
>>> set of workers pick these tasks and execute it. That is when actual<br>
>>> read/write for<br>
>>> heal happens.<br>
>>><br>
>>><br>
>>> ______________________________<wbr>__<br>
>>> From: "Serkan Çoban" <<a href="mailto:cobanserkan@gmail.com">cobanserkan@gmail.com</a>><br>
>>> To: "Ashish Pandey" <<a href="mailto:aspandey@redhat.com">aspandey@redhat.com</a>><br>
>>> Cc: "Gluster Users" <<a href="mailto:gluster-users@gluster.org">gluster-users@gluster.org</a>><br>
>>> Sent: Monday, May 29, 2017 6:44:50 PM<br>
>>> Subject: Re: [Gluster-users] Heal operation detail of EC volumes<br>
>>><br>
>>><br>
>>>>> Healing could be triggered by client side (access of file) or server<br>
>>>>> side<br>
>>>>> (shd).<br>
>>>>> However, in both the cases actual heal starts from "ec_heal_do"<br>
>>>>> function.<br>
>>><br>
>>> If I do a recursive getfattr operation from clients, then all heal<br>
>>> operation is done on clients right? Client read the chunks, calculate<br>
>>> and write the missing chunk.<br>
>>> And If I don't access files from client then SHD daemons will start<br>
>>> heal and read,calculate,write the missing chunks right?<br>
>>><br>
>>> In first case EC calculations takes places in client fuse process, in<br>
>>> second case EC calculations will be made in SHD process right?<br>
>>> Does brick process has any role in EC calculations?<br>
>>><br>
>>> On Mon, May 29, 2017 at 3:32 PM, Ashish Pandey <<a href="mailto:aspandey@redhat.com">aspandey@redhat.com</a>><br>
>>> wrote:<br>
>>>><br>
>>>><br>
>>>><br>
>>>> ______________________________<wbr>__<br>
>>>> From: "Serkan Çoban" <<a href="mailto:cobanserkan@gmail.com">cobanserkan@gmail.com</a>><br>
>>>> To: "Gluster Users" <<a href="mailto:gluster-users@gluster.org">gluster-users@gluster.org</a>><br>
>>>> Sent: Monday, May 29, 2017 5:13:06 PM<br>
>>>> Subject: [Gluster-users] Heal operation detail of EC volumes<br>
>>>><br>
>>>> Hi,<br>
>>>><br>
>>>> When a brick fails in EC, What is the healing read/write data path?<br>
>>>> Which processes do the operations?<br>
>>>><br>
>>>> Healing could be triggered by client side (access of file) or server<br>
>>>> side<br>
>>>> (shd).<br>
>>>> However, in both the cases actual heal starts from "ec_heal_do"<br>
>>>> function.<br>
>>>><br>
>>>><br>
>>>> Assume a 2GB file is being healed in 16+4 EC configuration. I was<br>
>>>> thinking that SHD deamon on failed brick host will read 2GB from<br>
>>>> network and reconstruct its 100MB chunk and write it on to brick. Is<br>
>>>> this right?<br>
>>>><br>
>>>> You are correct about read/write.<br>
>>>> The only point is that, SHD deamon on one of the good brick will pick<br>
>>>> the<br>
>>>> index entry and heal it.<br>
>>>> SHD deamon scans the .glusterfs/index directory and heals the entries.<br>
>>>> If<br>
>>>> the brick went down while IO was going on, index will be present on<br>
>>>> killed<br>
>>>> brick also.<br>
>>>> However, if a brick was down and then you started writing on a file then<br>
>>>> in<br>
>>>> this case index entry would not be present on killed brick.<br>
>>>> So even after brick will be UP, sdh on that brick will not be able to<br>
>>>> find<br>
>>>> it out this index. However, other bricks would have entries and shd on<br>
>>>> that<br>
>>>> brick will heal it.<br>
>>>><br>
>>>> Note: I am considering each brick on different node.<br>
>>>><br>
>>>> Ashish<br>
>>>><br>
>>>><br>
>>>><br>
>>>><br>
>>>><br>
>>>><br>
>>>><br>
>>>><br>
>>>><br>
>>>><br>
>>>> ______________________________<wbr>_________________<br>
>>>> Gluster-users mailing list<br>
>>>> <a href="mailto:Gluster-users@gluster.org">Gluster-users@gluster.org</a><br>
>>>> <a href="http://lists.gluster.org/mailman/listinfo/gluster-users" rel="noreferrer" target="_blank">http://lists.gluster.org/<wbr>mailman/listinfo/gluster-users</a><br>
>>>><br>
>>> ______________________________<wbr>_________________<br>
>>> Gluster-users mailing list<br>
>>> <a href="mailto:Gluster-users@gluster.org">Gluster-users@gluster.org</a><br>
>>> <a href="http://lists.gluster.org/mailman/listinfo/gluster-users" rel="noreferrer" target="_blank">http://lists.gluster.org/<wbr>mailman/listinfo/gluster-users</a><br>
>>><br>
>> ______________________________<wbr>_________________<br>
>> Gluster-users mailing list<br>
>> <a href="mailto:Gluster-users@gluster.org">Gluster-users@gluster.org</a><br>
>> <a href="http://lists.gluster.org/mailman/listinfo/gluster-users" rel="noreferrer" target="_blank">http://lists.gluster.org/<wbr>mailman/listinfo/gluster-users</a><br>
>><br>
><br>
______________________________<wbr>_________________<br>
Gluster-users mailing list<br>
<a href="mailto:Gluster-users@gluster.org">Gluster-users@gluster.org</a><br>
<a href="http://lists.gluster.org/mailman/listinfo/gluster-users" rel="noreferrer" target="_blank">http://lists.gluster.org/<wbr>mailman/listinfo/gluster-users</a></div></div></blockquote></div><br><br clear="all"><br>-- <br><div class="gmail_signature"><div dir="ltr">Pranith<br></div></div>
</div></div>