[Gluster-devel] readdir() scalability (was Re: [RFC ] dictionary optimizations)
xhernandez at datalab.es
Thu Sep 12 10:08:31 UTC 2013
Al 09/09/13 17:25, En/na Vijay Bellur ha escrit:
> On 09/09/2013 02:18 PM, Xavier Hernandez wrote:
>> Al 06/09/13 20:43, En/na Anand Avati ha escrit:
>>> On Fri, Sep 6, 2013 at 1:46 AM, Xavier Hernandez
>>> <xhernandez at datalab.es <mailto:xhernandez at datalab.es>> wrote:
>>> Al 04/09/13 18:10, En/na Anand Avati ha escrit:
>>>> On Wed, Sep 4, 2013 at 6:37 AM, Xavier Hernandez
>>>> <xhernandez at datalab.es <mailto:xhernandez at datalab.es>> wrote:
>>>> Al 04/09/13 14:05, En/na Jeff Darcy ha escrit:
>>>> On 09/04/2013 04:27 AM, Xavier Hernandez wrote:
>>>> I would also like to note that each node can store
>>>> multiple elements.
>>>> Current implementation creates a node for each byte
>>>> in the key. In my
>>>> implementation I only create a node if there is a
>>>> prefix coincidence between
>>>> 2 or more keys. This reduces the number of nodes and
>>>> the number of
>>>> Whatever we do, we should try to make sure that the
>>>> changes are profiled
>>>> against real usage. When I was making my own dict
>>>> optimizations back in March
>>>> of last year, I started by looking at how they're
>>>> actually used. At that time,
>>>> a significant majority of dictionaries contained just one
>>>> item. That's why I
>>>> only implemented a simple mechanism to pre-allocate the
>>>> first data_pair instead
>>>> of doing something more ambitious. Even then, the
>>>> difference in actual
>>>> performance or CPU usage was barely measurable. Dict
>>>> usage has certainly
>>>> changed since then, but I think you'd still be hard
>>>> pressed to find a case
>>>> where a single dict contains more than a handful of
>>>> entries, and approaches
>>>> that are optimized for dozens to hundreds might well
>>>> perform worse than simple
>>>> ones (e.g. because of cache aliasing or branch
>>>> If you're looking for other optimization opportunities
>>>> that might provide even
>>>> bigger "bang for the buck" then I suggest that
>>>> stack-frame or frame->local
>>>> allocations are a good place to start. Or string copying
>>>> in places like
>>>> loc_copy. Or the entire fd_ctx/inode_ctx subsystem. Let
>>>> me know and I'll come
>>>> up with a few more. To put a bit of a positive spin on
>>>> things, the GlusterFS
>>>> code offers many opportunities for improvement in terms
>>>> of CPU and memory
>>>> efficiency (though it's surprisingly still way better
>>>> than Ceph in that regard).
>>>> Yes. The optimizations on dictionary structures are not a big
>>>> improvement in the overall performance of GlusterFS. I tried
>>>> it on a real situation and the benefit was only marginal.
>>>> However I didn't test new features like an atomic lookup and
>>>> remove if found (because I would have had to review all the
>>>> code). I think this kind of functionalities could improve a
>>>> bit more the results I obtained.
>>>> However this is not the only reason to do these changes.
>>>> While I've been writing code I've found that it's tedious to
>>>> do some things just because there isn't such functions in
>>>> dict_t. Some actions require multiple calls, having to check
>>>> multiple errors and adding complexity and limiting
>>>> readability of the code. Many of these situations could be
>>>> solved using functions similar to what I proposed.
>>>> On the other side, if dict_t must be truly considered a
>>>> concurrent structure, there are a lot of race conditions that
>>>> might appear when doing some operations. It would require a
>>>> great effort to take care of all these possibilities
>>>> everywhere. It would be better to pack most of these
>>>> situations into functions inside the dict_t itself where it
>>>> is easier to combine some operations.
>>>> By the way, I've made some tests with multiple bricks and it
>>>> seems that there is a clear speed loss on directory listings
>>>> as the number of bricks increases. Since bricks should be
>>>> independent and they can work in parallel, I didn't expected
>>>> such a big performance degradation.
>>>> The likely reason is that, even though bricks are parallel for
>>>> IO, readdir is essentially a sequential operation and DHT has a
>>>> limitation that a readdir reply batch does not cross server
>>>> boundaries. So if you have 10 files and 1 server, all 10 entries
>>>> are returned in one call to the app/libc. If you have 10 files
>>>> and 10 servers evenly distributed, the app/libc has to perform 10
>>>> calls and keeps getting one file at a time. This problem goes
>>>> away when each server has enough files to fill up a readdir
>>>> batch. It's only when you have too few files and too many servers
>>>> that this "dilution" problem shows up. However, this is just a
>>>> theory and your problem may be something else too..
>>> I didn't know that DHT was doing a sequential brick scan on
>>> readdir(p) (my fault). Why is that ? Why it cannot return entries
>>> crossing a server boundary ? is it due to a technical reason or is
>>> it only due to the current implementation ?
>>> I've made a test using only directories (50 directories with 50
>>> subdirectories each). I started with one brick and I measured the
>>> time to do a recursive 'ls'. Then I sequentially added an
>>> additional brick, up to 6 (all of them physically independent),
>>> and repeated the ls. The time increases linearly as the number of
>>> bricks augments. As more bricks were added, the rebalancing time
>>> was also growing linearly.
>>> I think this is a big problem for scalability. It can be partially
>>> hidden by using some caching or preloading mechanisms, but it will
>>> be there and it will hit sooner or later.
>>>> Note that Brian Foster's readdir-ahead patch should address this
>>>> problem to a large extent. When loaded on top of DHT, the
>>>> prefiller effectively collapses the smaller chunks returned by
>>>> DHT into a larger chunk requested by the app/libc.
>>> I've seen it, however I think it will only partially mitigate and
>>> hide an existing problem. Imagine you have some hundreds or a
>>> thousand of bricks. I doubt readdir-ahead or anything else can
>>> hide the enormous latency that the sequential DHT scan will
>>> generate in that case.
>>> The main problem I see is that the full directory structure is
>>> read many times sequentially. I think it would be better to do the
>>> readdir(p) calls in parallel and combine them (possibly in
>>> background). This way the time to scan the directory structure
>>> would be almost constant, independently of the number of bricks.
>>> The design of the directory entries in DHT makes this essentially a
>>> sequential operation because entries from servers are appended, not
>>> striped. What I mean is, the logical ordering of
>>> All entries in a directory = All files and dirs in 0th server + All
>>> files (no dirs) in 1st server + All files (no dirs) in 2nd server + ..
>>> + All files (no dirs) in N'th server.
>>> in a sequential manner. If we read the entries of 2nd server along
>>> with entries of 1st server, we cannot "use" it till we finish reading
>>> all entries of 1st server and get EOD from it - which is why
>>> readdir-ahead is a more natural solution than reading in parallel for
>>> the above design.
>> As I understand it, what the read-ahead translator does is to collect
>> one or more answers from the DHT translator and combine them to return a
>> single answer as big as possible. If that is correct, it will certainly
>> reduce the number of readdir calls from application, however I think it
>> will still have a considerable latency when used on big clusters. Anyway
>> I don't have any measurement or valid argument to support this, so lets
>> see how readdir-ahead works in real environments before discussing
>> about it.
>>> Also, this is a problem only if each server has fewer entries than
>>> what can be returned in a single readdir() request by the application.
>>> As long as the server has more than this "minimum threshold" of number
>>> of files, the number of batched readdir() made by the client is going
>>> to be fixed, and those various requests will be spread across various
>>> servers (as opposed to, sending them all to the same server).
>> I've seen customers with large amounts of empty, or almost empty,
>> directories. Don't ask me why, I don't understand it either...
>>> So yes, as you add servers for a given small set of files the
>>> scalability drops, but that is only till you create more files, when
>>> the # of servers stop mattering again.
>>> Can you share the actual numbers from the tests you ran?
>> I've made the tests in 6 physical servers (Quad Atom D525 1.8 GHz. These
>> are the only servers I can use regularly to do tests) connected through
>> a dedicated 1 Gbit switch. Bricks are stored in 1TB SATA disks with ZFS.
>> One of the servers was also used as a client to do the tests.
>> Initially I created a volume with a single brick. I initialized the
>> volume with 50 directories with 50 subdirectories each (a total of 2500
>> directories). No files.
> Have you tried turning on "cluster.readdir-optimize"? This could help
> improve readdir performance for the directory hierarchy that you
I repeated the tests with this option enabled and it really improved
readdir performance, however it still shows a linear speed loss as the
number of bricks increases. Will the readdir-ahead translator be able to
hide this linear effect when the number of bricks is very high ?
Results of the tests with cluser.readdir-optimize active:
1 brick: 11.8 seconds
2 bricks: 15.4 seconds
3 bricks: 17.9 seconds
4 bricks: 20.6 seconds
5 bricks: 22.9 seconds
6 bricks: 25.4 seconds
12 bricks: 41.8 seconds
Rebalance also improved:
From 1 to 2 bricks: 77 seconds
From 2 to 3 bricks: 78 seconds
From 3 to 4 bricks: 81 seconds
From 4 to 5 bricks: 84 seconds
From 5 to 6 bricks: 87 seconds
From 6 to 12 bricks: 144 seconds
>> After each test, I added a new brick and started a rebalance. Once the
>> rebalance was completed, I umounted and stopped the volume and restarted
>> it again.
>> The test consisted of 4 'time ls -lR /<testdir> | wc -l'. The first
>> result was discarded. The result shown below is the mean of the other 3
>> 1 brick: 11.8 seconds
>> 2 bricks: 19.0 seconds
>> 3 bricks: 23.8 seconds
>> 4 bricks: 29.8 seconds
>> 5 bricks: 34.6 seconds
>> 6 bricks: 41.0 seconds
>> 12 bricks (2 bricks on each server): 78.5 seconds
>> The rebalancing time also grew considerably (these times are the result
>> of a single rebalance. They might not be very accurate):
>> From 1 to 2 bricks: 91 seconds
>> From 2 to 3 bricks: 102 seconds
>> From 3 to 4 bricks: 119 seconds
>> From 4 to 5 bricks: 138 seconds
>> From 5 to 6 bricks: 151 seconds
>> From 6 to 12 bricks: 259 seconds
>> The number of disk IOPS didn't exceed 40 in any server in any case. The
>> network bandwidth didn't go beyond 6 Mbits/s between any pair of servers
>> and none of them reached 100% core usage.
>>> Gluster-devel mailing list
>>> Gluster-devel at nongnu.org
>> Gluster-devel mailing list
>> Gluster-devel at nongnu.org
> Gluster-devel mailing list
> Gluster-devel at nongnu.org
More information about the Gluster-devel