[Gluster-users] disperse volume brick counts limits in RHES

Tue May 9 06:53:19 UTC 2017

Hi Alastair,

the numbers I'm giving correspond to an Intel Xeon E5-2630L 2 GHz CPU.

On 08/05/17 22:44, Alastair Neil wrote:
> so the bottleneck is that computations with 16x20 matrix require  ~4
> times the cycles?

This is only part of the problem. A 16x16 matrix can be processed at a 
rate of 400 MB/s, so a single fragment on a brick will be processed at 
400/16 = 25 MB/s which is not the case.

Note that the fragment on a brick is only part of a whole file, so 25 
MB/s on a brick means that the real file is being processed at 400 MB/s.

> It seems then that there is ample room for
> improvement, as there are many linear algebra packages out there that
> scale better than O(nxm).

That's true for much bigger matrices where synchronization time between 
threads is negligible compared to the computation time. In this case the 
algorithm is highly optimized and any attempt to distribute the 
computation would be worse.

Note that the current algorithm can rebuild the original data at a rate 
of ~5 CPU cycles per byte with a 16x16 configuration without any SIMD 
extension. With SSE or AVX this goes down to near 1 cycle per byte.

In this case the best we can do is to do more than one heal in parallel. 
This will use more than one core to compute the matrices, getting an 
overall better performance.

> Is the healing time dominated by the EC
> compute time?  If Serkan saw a hard 2x scaling then it seems likely.

Partially. The computation speed is doubled on a 8+2 configuration, but 
also the number of IOPS is halved, and each one is of twice the size of 
a 16+4 operation. This means that we only have half of the latencies 
when using 8+2 and bandwidth is better utilized.

The theoretical speed of matrix processing is 25 MB/s per brick, but the 
real speed seen is considerably smaller, so network latencies and other 
factors also contribute to the heal time.

Xavi

>
> -Alastair
>
>
>
>
> On 8 May 2017 at 03:02, Xavier Hernandez <xhernandez at datalab.es
> <mailto:xhernandez at datalab.es>> wrote:
>
>     On 05/05/17 13:49, Pranith Kumar Karampuri wrote:
>
>
>
>         On Fri, May 5, 2017 at 2:38 PM, Serkan Çoban
>         <cobanserkan at gmail.com <mailto:cobanserkan at gmail.com>
>         <mailto:cobanserkan at gmail.com <mailto:cobanserkan at gmail.com>>>
>         wrote:
>
>             It is the over all time, 8TB data disk healed 2x faster in 8+2
>             configuration.
>
>
>         Wow, that is counter intuitive for me. I will need to explore
>         about this
>         to find out why that could be. Thanks a lot for this feedback!
>
>
>     Matrix multiplication for encoding/decoding of 8+2 is 4 times faster
>     than 16+4 (one matrix of 16x16 is composed by 4 submatrices of 8x8),
>     however each matrix operation on a 16+4 configuration takes twice
>     the amount of data of a 8+2, so net effect is that 8+2 is twice as
>     fast as 16+4.
>
>     An 8+2 also uses bigger blocks on each brick, processing the same
>     amount of data in less I/O operations and bigger network packets.
>
>     Probably these are the reasons why 16+4 is slower than 8+2.
>
>     See my other email for more detailed description.
>
>     Xavi
>
>
>
>
>             On Fri, May 5, 2017 at 10:00 AM, Pranith Kumar Karampuri
>             <pkarampu at redhat.com <mailto:pkarampu at redhat.com>
>         <mailto:pkarampu at redhat.com <mailto:pkarampu at redhat.com>>> wrote:
>             >
>             >
>             > On Fri, May 5, 2017 at 11:42 AM, Serkan Çoban
>             <cobanserkan at gmail.com <mailto:cobanserkan at gmail.com>
>         <mailto:cobanserkan at gmail.com <mailto:cobanserkan at gmail.com>>>
>         wrote:
>             >>
>             >> Healing gets slower as you increase m in m+n configuration.
>             >> We are using 16+4 configuration without any problems
>         other then heal
>             >> speed.
>             >> I tested heal speed with 8+2 and 16+4 on 3.9.0 and see
>         that heals on
>             >> 8+2 is faster by 2x.
>             >
>             >
>             > As you increase number of nodes that are participating in
>         an EC
>             set number
>             > of parallel heals increase. Is the heal speed you saw
>         improved per
>             file or
>             > the over all time it took to heal the data?
>             >
>             >>
>             >>
>             >>
>             >> On Fri, May 5, 2017 at 9:04 AM, Ashish Pandey
>             <aspandey at redhat.com <mailto:aspandey at redhat.com>
>         <mailto:aspandey at redhat.com <mailto:aspandey at redhat.com>>> wrote:
>             >> >
>             >> > 8+2 and 8+3 configurations are not the limitation but just
>             suggestions.
>             >> > You can create 16+3 volume without any issue.
>             >> >
>             >> > Ashish
>             >> >
>             >> > ________________________________
>             >> > From: "Alastair Neil" <ajneil.tech at gmail.com
>         <mailto:ajneil.tech at gmail.com>
>             <mailto:ajneil.tech at gmail.com <mailto:ajneil.tech at gmail.com>>>
>             >> > To: "gluster-users" <gluster-users at gluster.org
>         <mailto:gluster-users at gluster.org>
>             <mailto:gluster-users at gluster.org
>         <mailto:gluster-users at gluster.org>>>
>             >> > Sent: Friday, May 5, 2017 2:23:32 AM
>             >> > Subject: [Gluster-users] disperse volume brick counts
>         limits in
>             RHES
>             >> >
>             >> >
>             >> > Hi
>             >> >
>             >> > we are deploying a large (24node/45brick) cluster and noted
>             that the
>             >> > RHES
>             >> > guidelines limit the number of data bricks in a
>         disperse set to
>             8.  Is
>             >> > there
>             >> > any reason for this.  I am aware that you want this to be a
>             power of 2,
>             >> > but
>             >> > as we have a large number of nodes we were planning on
>         going
>             with 16+3.
>             >> > Dropping to 8+2 or 8+3 will be a real waste for us.
>             >> >
>             >> > Thanks,
>             >> >
>             >> >
>             >> > Alastair
>             >> >
>             >> >
>             >> > _______________________________________________
>             >> > Gluster-users mailing list
>             >> > Gluster-users at gluster.org
>         <mailto:Gluster-users at gluster.org>
>         <mailto:Gluster-users at gluster.org
>         <mailto:Gluster-users at gluster.org>>
>             >> > http://lists.gluster.org/mailman/listinfo/gluster-users
>         <http://lists.gluster.org/mailman/listinfo/gluster-users>
>             <http://lists.gluster.org/mailman/listinfo/gluster-users
>         <http://lists.gluster.org/mailman/listinfo/gluster-users>>
>             >> >
>             >> >
>             >> > _______________________________________________
>             >> > Gluster-users mailing list
>             >> > Gluster-users at gluster.org
>         <mailto:Gluster-users at gluster.org>
>         <mailto:Gluster-users at gluster.org
>         <mailto:Gluster-users at gluster.org>>
>             >> > http://lists.gluster.org/mailman/listinfo/gluster-users
>         <http://lists.gluster.org/mailman/listinfo/gluster-users>
>             <http://lists.gluster.org/mailman/listinfo/gluster-users
>         <http://lists.gluster.org/mailman/listinfo/gluster-users>>
>             >> _______________________________________________
>             >> Gluster-users mailing list
>             >> Gluster-users at gluster.org
>         <mailto:Gluster-users at gluster.org>
>         <mailto:Gluster-users at gluster.org
>         <mailto:Gluster-users at gluster.org>>
>             >> http://lists.gluster.org/mailman/listinfo/gluster-users
>         <http://lists.gluster.org/mailman/listinfo/gluster-users>
>             <http://lists.gluster.org/mailman/listinfo/gluster-users
>         <http://lists.gluster.org/mailman/listinfo/gluster-users>>
>             >
>             >
>             >
>             >
>             > --
>             > Pranith
>
>
>
>
>         --
>         Pranith
>
>
>         _______________________________________________
>         Gluster-users mailing list
>         Gluster-users at gluster.org <mailto:Gluster-users at gluster.org>
>         http://lists.gluster.org/mailman/listinfo/gluster-users
>         <http://lists.gluster.org/mailman/listinfo/gluster-users>
>
>
>     _______________________________________________
>     Gluster-users mailing list
>     Gluster-users at gluster.org <mailto:Gluster-users at gluster.org>
>     http://lists.gluster.org/mailman/listinfo/gluster-users
>     <http://lists.gluster.org/mailman/listinfo/gluster-users>
>
>