[Gluster-devel] readdir() scalability (was Re: [RFC ] dictionary optimizations)

Fri Sep 6 18:43:37 UTC 2013

On Fri, Sep 6, 2013 at 1:46 AM, Xavier Hernandez <xhernandez at datalab.es>wrote:

>  Al 04/09/13 18:10, En/na Anand Avati ha escrit:
>
> On Wed, Sep 4, 2013 at 6:37 AM, Xavier Hernandez <xhernandez at datalab.es>wrote:
>
>> Al 04/09/13 14:05, En/na Jeff Darcy ha escrit:
>>
>>  On 09/04/2013 04:27 AM, Xavier Hernandez wrote:
>>>
>>>> I would also like to note that each node can store multiple elements.
>>>> Current implementation creates a node for each byte in the key. In my
>>>> implementation I only create a node if there is a prefix coincidence
>>>> between
>>>> 2 or more keys. This reduces the number of nodes and the number of
>>>> indirections.
>>>>
>>>
>>> Whatever we do, we should try to make sure that the changes are profiled
>>> against real usage.  When I was making my own dict optimizations back in
>>> March
>>> of last year, I started by looking at how they're actually used. At that
>>> time,
>>> a significant majority of dictionaries contained just one item. That's
>>> why I
>>> only implemented a simple mechanism to pre-allocate the first data_pair
>>> instead
>>> of doing something more ambitious.  Even then, the difference in actual
>>> performance or CPU usage was barely measurable.  Dict usage has certainly
>>> changed since then, but I think you'd still be hard pressed to find a
>>> case
>>> where a single dict contains more than a handful of entries, and
>>> approaches
>>> that are optimized for dozens to hundreds might well perform worse than
>>> simple
>>> ones (e.g. because of cache aliasing or branch misprediction).
>>>
>>> If you're looking for other optimization opportunities that might
>>> provide even
>>> bigger "bang for the buck" then I suggest that stack-frame or
>>> frame->local
>>> allocations are a good place to start.  Or string copying in places like
>>> loc_copy.  Or the entire fd_ctx/inode_ctx subsystem.  Let me know and
>>> I'll come
>>> up with a few more.  To put a bit of a positive spin on things, the
>>> GlusterFS
>>> code offers many opportunities for improvement in terms of CPU and memory
>>> efficiency (though it's surprisingly still way better than Ceph in that
>>> regard).
>>>
>>>   Yes. The optimizations on dictionary structures are not a big
>> improvement in the overall performance of GlusterFS. I tried it on a real
>> situation and the benefit was only marginal. However I didn't test new
>> features like an atomic lookup and remove if found (because I would have
>> had to review all the code). I think this kind of functionalities could
>> improve a bit more the results I obtained.
>>
>> However this is not the only reason to do these changes. While I've been
>> writing code I've found that it's tedious to do some things just because
>> there isn't such functions in dict_t. Some actions require multiple calls,
>> having to check multiple errors and adding complexity and limiting
>> readability of the code. Many of these situations could be solved using
>> functions similar to what I proposed.
>>
>> On the other side, if dict_t must be truly considered a concurrent
>> structure, there are a lot of race conditions that might appear when doing
>> some operations. It would require a great effort to take care of all these
>> possibilities everywhere. It would be better to pack most of these
>> situations into functions inside the dict_t itself where it is easier to
>> combine some operations.
>>
>> By the way, I've made some tests with multiple bricks and it seems that
>> there is a clear speed loss on directory listings as the number of bricks
>> increases. Since bricks should be independent and they can work in
>> parallel, I didn't expected such a big performance degradation.
>
>
>  The likely reason is that, even though bricks are parallel for IO,
> readdir is essentially a sequential operation and DHT has a limitation that
> a readdir reply batch does not cross server boundaries. So if you have 10
> files and 1 server, all 10 entries are returned in one call to the
> app/libc. If you have 10 files and 10 servers evenly distributed, the
> app/libc has to perform 10 calls and keeps getting one file at a time. This
> problem goes away when each server has enough files to fill up a readdir
> batch. It's only when you have too few files and too many servers that this
> "dilution" problem shows up. However, this is just a theory and your
> problem may be something else too..
>
>    I didn't know that DHT was doing a sequential brick scan on readdir(p)
> (my fault). Why is that ? Why it cannot return entries crossing a server
> boundary ? is it due to a technical reason or is it only due to the current
> implementation ?
>
> I've made a test using only directories (50 directories with 50
> subdirectories each). I started with one brick and I measured the time to
> do a recursive 'ls'. Then I sequentially added an additional brick, up to 6
> (all of them physically independent), and repeated the ls. The time
> increases linearly as the number of bricks augments. As more bricks were
> added, the rebalancing time was also growing linearly.
>
> I think this is a big problem for scalability. It can be partially hidden
> by using some caching or preloading mechanisms, but it will be there and it
> will hit sooner or later.
>
>
>   Note that Brian Foster's readdir-ahead patch should address this
> problem to a large extent. When loaded on top of DHT, the prefiller
> effectively collapses the smaller chunks returned by DHT into a larger
> chunk requested by the app/libc.
>
>    I've seen it, however I think it will only partially mitigate and hide
> an existing problem. Imagine you have some hundreds or a thousand of
> bricks. I doubt readdir-ahead or anything else can hide the enormous
> latency that the sequential DHT scan will generate in that case.
>
> The main problem I see is that the full directory structure is read many
> times sequentially. I think it would be better to do the readdir(p) calls
> in parallel and combine them (possibly in background). This way the time to
> scan the directory structure would be almost constant, independently of the
> number of bricks.
>

The design of the directory entries in DHT makes this essentially a
sequential operation because entries from servers are appended, not
striped. What I mean is, the logical ordering of

All entries in a directory = All files and dirs in 0th server + All files
(no dirs) in 1st server + All files (no dirs) in 2nd server + .. + All
files (no dirs) in N'th server.

in a sequential manner. If we read the entries of 2nd server along with
entries of 1st server, we cannot "use" it till we finish reading all
entries of 1st server and get EOD from it - which is why readdir-ahead is a
more natural solution than reading in parallel for the above design.

Also, this is a problem only if each server has fewer entries than what can
be returned in a single readdir() request by the application. As long as
the server has more than this "minimum threshold" of number of files, the
number of batched readdir() made by the client is going to be fixed, and
those various requests will be spread across various servers (as opposed
to, sending them all to the same server).

So yes, as you add servers for a given small set of files the scalability
drops, but that is only till you create more files, when the # of servers
stop mattering again.

Can you share the actual numbers from the tests you ran?

Avati
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-devel/attachments/20130906/5845c22e/attachment-0001.html>