[Gluster-users] ?==?utf-8?q? disperse volume brick counts limits in RHES

Fri May 5 21:28:53 UTC 2017

Hi,

not sure about the exact reason of the slow self-heal, but we can try to see what's taking time. I would consider 3 factors:

1. Erasure code algorithm
2. Small I/O blocks
3. Latencies

Some testing on 1) shows that a single core of an Intel Xeon E5-2630L 2 GHz, can encode and decode at ~400 MB/s using a 16+4 configuration (using latest version without any CPU extensions. Using SSE, it's ~3 times faster). When self-heal is involved, to heal a single brick we need to read data from healthy bricks and decode it. If we suppose that we are healing a file of 2GB, since we can decode at ~400 MB/s, it will take ~5 seconds to recover the original data. Now we reencode it to generate the missing fragment. Since we only need to generate a single fragment, the encoding speed is much faster. Normally it would take 5 seconds to encode the whole file, but in this case it takes ~1/20 of the time, so ~250ms. So we need 5.25 seconds to heal the file. The total written data to the damaged brick will be 1/16 of the total size, so 128 MB. This gives an effective speed of ~24 MB/s on the brick being healed.

By default the healing process uses blocks of 128KB. On a 16+4 configuration this means that 16 fragments of 8KB are read from 16 healthy bricks and a single fragment of 8KB is written to the damaged brick for each block of 128KB of the original file. So we need to write 128 MB in blocks of 8 KB. This means that ~16000 I/O operations are needed on each brick to heal the file. Due to the serialization of reads and writes, it's less probable that the target brick can merge write requests to improve performance, though some are indeed merged. Giving exact numbers here depends on a lot of factors. Just as a reference number, if we suppose that we are able to write at 24 MB/s and no merges are done, it would mean ~3000 IOPS. This is a good number of IOPS.

The heal process involves multiple operations that need to be serialized. This means that multiple network latencies are added to heal a single block of the damaged file. Basically we read a block from healthy bricks and then it's written to damaged bricks. An 8 KB packet has a latency of ~0.15 ms on a 10 Gbps ethernet. Reads from disk can be in the order of some ms, though disk read-ahead can reduce the amortized read time to less than a ms (say ~0.1 ms). Writes can take longer, but let's suppose that they are cached and immediately acknowledged. In this case we need 0.15 ms * 2 (read and write network latency) + 0.1 ms (disk read latency amortized) = ~0.4 ms per block. 16000 I/O requests * 0.4 ms = 6.5 seconds.

If we add the time for decoding/encoding to the time of latency, we get ~11.5 seconds. 128 MB / 11.5 s = ~11 MB/s.

Note that the numbers are not taken from real tests. They are only reasonable theorical numbers, but the result seems quite similar to what you are seeing. Also note that even if the write speed on the damaged brick is only 11 MB/s, the amount of data processed of the original file is 176 MB/s (writting 128 MB on the damaged brick means that the whole 2GB of the file have been processed in 11.5 seconds).

To improve these numbers you need to use more cores to encode/decode faster and reduce the number of I/O operations needed. This first one can be done with parallel self-heal. Using CPU extensions like SSE or AVX can improve performance by x3 or x6 also. The second one is being worked on to allow self-heal to be configured to use bigger blocks to heal, reducing the number of network round-trips, thus the total latency.

I hope this gives a general overview of what's going on with self-heal.

Xavi

On Friday, May 05, 2017 15:54 CEST, Pranith Kumar Karampuri <pkarampu at redhat.com> wrote:
 Wondering if Xavi knows something. On Fri, May 5, 2017 at 7:24 PM, Pranith Kumar Karampuri <pkarampu at redhat.com> wrote:  On Fri, May 5, 2017 at 7:21 PM, Serkan Çoban <cobanserkan at gmail.com> wrote:In our use case every node has 26 bricks. I am using 60 nodes, one 9PB
volume with 16+4 EC configuration, each brick in a sub-volume is on
different host.
We put 15-20k 2GB files every day into 10-15 folders. So it is 1500K
files/folder. Our gluster version is 3.7.11.
Heal speed in this environment is 8-10MB/sec/brick.

I did some tests for parallel self heal feature with version 3.9, two
servers 26 bricks each, 8+2 and 16+4 EC configuration.
This was a small test environment and the results are as I said 8+2 is
2x faster then 16+4 with parallel self heal threads set to 2/4.
In 1-2 months our new servers arriving, I will do detailed tests for
heal performance for 8+2 and 16+4 and inform you the results. In that case I still don't know why this is the case. Thanks for the inputs. I will also try to find out how long a 2GB file takes in 8+2 vs 16+4 and see if there is something I need to look closely. 

On Fri, May 5, 2017 at 2:54 PM, Pranith Kumar Karampuri<pkarampu at redhat.com> wrote:
>
>
> On Fri, May 5, 2017 at 5:19 PM, Pranith Kumar Karampuri
> <pkarampu at redhat.com> wrote:
>>
>>
>>
>> On Fri, May 5, 2017 at 2:38 PM, Serkan Çoban <cobanserkan at gmail.com>
>> wrote:
>>>
>>> It is the over all time, 8TB data disk healed 2x faster in 8+2
>>> configuration.
>>
>>
>> Wow, that is counter intuitive for me. I will need to explore about this
>> to find out why that could be. Thanks a lot for this feedback!
>
>
> From memory I remember you said you have a lot of small files hosted on the
> volume, right? It could be because of the bug
> https://review.gluster.org/17151 is fixing. That is the only reason I could
> guess right now. We will try to test this kind of case if you could give us
> a bit more details about average file-size/depth of directories etc to
> simulate similar looking directory structure.
>
>>
>>
>>>
>>>
>>> On Fri, May 5, 2017 at 10:00 AM, Pranith Kumar Karampuri
>>> <pkarampu at redhat.com> wrote:
>>> >
>>> >
>>> > On Fri, May 5, 2017 at 11:42 AM, Serkan Çoban <cobanserkan at gmail.com>
>>> > wrote:
>>> >>
>>> >> Healing gets slower as you increase m in m+n configuration.
>>> >> We are using 16+4 configuration without any problems other then heal
>>> >> speed.
>>> >> I tested heal speed with 8+2 and 16+4 on 3.9.0 and see that heals on
>>> >> 8+2 is faster by 2x.
>>> >
>>> >
>>> > As you increase number of nodes that are participating in an EC set
>>> > number
>>> > of parallel heals increase. Is the heal speed you saw improved per file
>>> > or
>>> > the over all time it took to heal the data?
>>> >
>>> >>
>>> >>
>>> >>
>>> >> On Fri, May 5, 2017 at 9:04 AM, Ashish Pandey <aspandey at redhat.com>
>>> >> wrote:
>>> >> >
>>> >> > 8+2 and 8+3 configurations are not the limitation but just
>>> >> > suggestions.
>>> >> > You can create 16+3 volume without any issue.
>>> >> >
>>> >> > Ashish
>>> >> >
>>> >> > ________________________________
>>> >> > From: "Alastair Neil" <ajneil.tech at gmail.com>
>>> >> > To: "gluster-users" <gluster-users at gluster.org>
>>> >> > Sent: Friday, May 5, 2017 2:23:32 AM
>>> >> > Subject: [Gluster-users] disperse volume brick counts limits in RHES
>>> >> >
>>> >> >
>>> >> > Hi
>>> >> >
>>> >> > we are deploying a large (24node/45brick) cluster and noted that the
>>> >> > RHES
>>> >> > guidelines limit the number of data bricks in a disperse set to 8.
>>> >> > Is
>>> >> > there
>>> >> > any reason for this.  I am aware that you want this to be a power of
>>> >> > 2,
>>> >> > but
>>> >> > as we have a large number of nodes we were planning on going with
>>> >> > 16+3.
>>> >> > Dropping to 8+2 or 8+3 will be a real waste for us.
>>> >> >
>>> >> > Thanks,
>>> >> >
>>> >> >
>>> >> > Alastair
>>> >> >
>>> >> >
>>> >> > _______________________________________________
>>> >> > Gluster-users mailing list
>>> >> > Gluster-users at gluster.org
>>> >> > http://lists.gluster.org/mailman/listinfo/gluster-users
>>> >> >
>>> >> >
>>> >> > _______________________________________________
>>> >> > Gluster-users mailing list
>>> >> > Gluster-users at gluster.org
>>> >> > http://lists.gluster.org/mailman/listinfo/gluster-users
>>> >> _______________________________________________
>>> >> Gluster-users mailing list
>>> >> Gluster-users at gluster.org
>>> >> http://lists.gluster.org/mailman/listinfo/gluster-users
>>> >
>>> >
>>> >
>>> >
>>> > --
>>> > Pranith
>>
>>
>>
>>
>> --
>> Pranith
>
>
>
>
> --
> Pranith

-- Pranith 

--Pranith

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20170505/6f6e48ae/attachment.html>