[Gluster-devel] readdir() scalability (was Re: [RFC ] dictionary optimizations)

Fri Sep 13 08:05:46 UTC 2013

Al 12/09/13 13:17, En/na Brian Foster ha escrit:
> On 09/12/2013 06:08 AM, Xavier Hernandez wrote:
>> Al 09/09/13 17:25, En/na Vijay Bellur ha escrit:
>>> On 09/09/2013 02:18 PM, Xavier Hernandez wrote:
>>>> Al 06/09/13 20:43, En/na Anand Avati ha escrit:
>>>>> On Fri, Sep 6, 2013 at 1:46 AM, Xavier Hernandez
>>>>> <xhernandez at datalab.es <mailto:xhernandez at datalab.es>> wrote:
>>>>>
>>>>>      Al 04/09/13 18:10, En/na Anand Avati ha escrit:
>>>>>>      On Wed, Sep 4, 2013 at 6:37 AM, Xavier Hernandez
>>>>>>      <xhernandez at datalab.es <mailto:xhernandez at datalab.es>> wrote:
>>>>>>
>>>>>>          Al 04/09/13 14:05, En/na Jeff Darcy ha escrit:
>>>>>>
>>>>>>              On 09/04/2013 04:27 AM, Xavier Hernandez wrote:
>>>>>>
> ...
>>> Have you tried turning on "cluster.readdir-optimize"? This could help
>>> improve readdir performance for the directory hierarchy that you
>>> describe.
>>>
>> I repeated the tests with this option enabled and it really improved
>> readdir performance, however it still shows a linear speed loss as the
>> number of bricks increases. Will the readdir-ahead translator be able to
>> hide this linear effect when the number of bricks is very high ?
>>
> I don't know that it will change the overall effect, but perhaps it
> could smooth things out (or if not, we can see about further
> improvements). Could you try it out and let us know? :)
I've repeated the tests using master branch (commit 643533c7), combining 
cluster.readdir-optimize and performance.readdir-ahead. These are the 
results:

Configurations

   Test1: cluster.readdir-optimize=off and performance.readdir-ahead=off
   Test2: cluster.readdir-optimize=on  and performance.readdir-ahead=off
   Test3: cluster.readdir-optimize=off and performance.readdir-ahead=on
   Test4: cluster.readdir-optimize=on  and performance.readdir-ahead=on

ls: average time to complete 3 'ls -lR <mount root> | wc -l'
     (a previous ls is made to fill the caches)
rb: rebalance time (not averaged, only done once)

Bricks    Test1       Test2       Test3       Test4
          ls   rb     ls   rb     ls   rb     ls   rb
    1    10.7  --    10.6  --     9.8  --     9.8  --
    2    18.7  82    14.1  84    17.1  83    13.5  82
    3    24.6  83    16.8  84    23.1  84    16.4  85
    4    30.2  87    19.7  86    29.0  88    19.2  87
    5    36.0  92    22.5  90    34.8  91    21.7  91
    6    42.2  97    25.1  96    40.9  95    24.1  96
   12    80.4 161    42.1 160    81.3 162    41.5 162

It seems that the benefit is minimal when only considering the directory 
structure.

Xavi

> Brian
>
>> Results of the tests with cluser.readdir-optimize active:
>>
>> 1 brick: 11.8 seconds
>> 2 bricks: 15.4 seconds
>> 3 bricks: 17.9 seconds
>> 4 bricks: 20.6 seconds
>> 5 bricks: 22.9 seconds
>> 6 bricks: 25.4 seconds
>> 12 bricks: 41.8 seconds
>>
>> Rebalance also improved:
>>
>>  From 1 to 2 bricks: 77 seconds
>>  From 2 to 3 bricks: 78 seconds
>>  From 3 to 4 bricks: 81 seconds
>>  From 4 to 5 bricks: 84 seconds
>>  From 5 to 6 bricks: 87 seconds
>>  From 6 to 12 bricks: 144 seconds
>>
>> Xavi
>>
>>> -Vijay
>>>
>>>
>>>> After each test, I added a new brick and started a rebalance. Once the
>>>> rebalance was completed, I umounted and stopped the volume and restarted
>>>> it again.
>>>>
>>>> The test consisted of 4 'time ls -lR /<testdir> | wc -l'. The first
>>>> result was discarded. The result shown below is the mean of the other 3
>>>> results.
>>>>
>>>> 1 brick: 11.8 seconds
>>>> 2 bricks: 19.0 seconds
>>>> 3 bricks: 23.8 seconds
>>>> 4 bricks: 29.8 seconds
>>>> 5 bricks: 34.6 seconds
>>>> 6 bricks: 41.0 seconds
>>>> 12 bricks (2 bricks on each server): 78.5 seconds
>>>>
>>>> The rebalancing time also grew considerably (these times are the result
>>>> of a single rebalance. They might not be very accurate):
>>>>
>>>>   From 1 to 2 bricks: 91 seconds
>>>>   From 2 to 3 bricks: 102 seconds
>>>>   From 3 to 4 bricks: 119 seconds
>>>>   From 4 to 5 bricks: 138 seconds
>>>>   From 5 to 6 bricks: 151 seconds
>>>>   From 6 to 12 bricks: 259 seconds
>>>>
>>>> The number of disk IOPS didn't exceed 40 in any server in any case. The
>>>> network bandwidth didn't go beyond 6 Mbits/s between any pair of servers
>>>> and none of them reached 100% core usage.
>>>>
>>>> Xavi
>>>>
>>>>> Avati
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Gluster-devel mailing list
>>>>> Gluster-devel at nongnu.org
>>>>> https://lists.nongnu.org/mailman/listinfo/gluster-devel
>>>>
>>>>
>>>> _______________________________________________
>>>> Gluster-devel mailing list
>>>> Gluster-devel at nongnu.org
>>>> https://lists.nongnu.org/mailman/listinfo/gluster-devel
>>>>
>>>
>>> _______________________________________________
>>> Gluster-devel mailing list
>>> Gluster-devel at nongnu.org
>>> https://lists.nongnu.org/mailman/listinfo/gluster-devel
>>
>> _______________________________________________
>> Gluster-devel mailing list
>> Gluster-devel at nongnu.org
>> https://lists.nongnu.org/mailman/listinfo/gluster-devel
>
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at nongnu.org
> https://lists.nongnu.org/mailman/listinfo/gluster-devel