[Gluster-devel] readdir() scalability (was Re: [RFC ] dictionary optimizations)

Xavier Hernandez xhernandez at datalab.es
Mon Sep 9 08:48:21 UTC 2013

Al 06/09/13 20:43, En/na Anand Avati ha escrit:
> On Fri, Sep 6, 2013 at 1:46 AM, Xavier Hernandez 
> <xhernandez at datalab.es <mailto:xhernandez at datalab.es>> wrote:
>     Al 04/09/13 18:10, En/na Anand Avati ha escrit:
>>     On Wed, Sep 4, 2013 at 6:37 AM, Xavier Hernandez
>>     <xhernandez at datalab.es <mailto:xhernandez at datalab.es>> wrote:
>>         Al 04/09/13 14:05, En/na Jeff Darcy ha escrit:
>>             On 09/04/2013 04:27 AM, Xavier Hernandez wrote:
>>                 I would also like to note that each node can store
>>                 multiple elements.
>>                 Current implementation creates a node for each byte
>>                 in the key. In my
>>                 implementation I only create a node if there is a
>>                 prefix coincidence between
>>                 2 or more keys. This reduces the number of nodes and
>>                 the number of
>>                 indirections.
>>             Whatever we do, we should try to make sure that the
>>             changes are profiled
>>             against real usage.  When I was making my own dict
>>             optimizations back in March
>>             of last year, I started by looking at how they're
>>             actually used. At that time,
>>             a significant majority of dictionaries contained just one
>>             item. That's why I
>>             only implemented a simple mechanism to pre-allocate the
>>             first data_pair instead
>>             of doing something more ambitious.  Even then, the
>>             difference in actual
>>             performance or CPU usage was barely measurable.  Dict
>>             usage has certainly
>>             changed since then, but I think you'd still be hard
>>             pressed to find a case
>>             where a single dict contains more than a handful of
>>             entries, and approaches
>>             that are optimized for dozens to hundreds might well
>>             perform worse than simple
>>             ones (e.g. because of cache aliasing or branch
>>             misprediction).
>>             If you're looking for other optimization opportunities
>>             that might provide even
>>             bigger "bang for the buck" then I suggest that
>>             stack-frame or frame->local
>>             allocations are a good place to start.  Or string copying
>>             in places like
>>             loc_copy.  Or the entire fd_ctx/inode_ctx subsystem.  Let
>>             me know and I'll come
>>             up with a few more.  To put a bit of a positive spin on
>>             things, the GlusterFS
>>             code offers many opportunities for improvement in terms
>>             of CPU and memory
>>             efficiency (though it's surprisingly still way better
>>             than Ceph in that regard).
>>         Yes. The optimizations on dictionary structures are not a big
>>         improvement in the overall performance of GlusterFS. I tried
>>         it on a real situation and the benefit was only marginal.
>>         However I didn't test new features like an atomic lookup and
>>         remove if found (because I would have had to review all the
>>         code). I think this kind of functionalities could improve a
>>         bit more the results I obtained.
>>         However this is not the only reason to do these changes.
>>         While I've been writing code I've found that it's tedious to
>>         do some things just because there isn't such functions in
>>         dict_t. Some actions require multiple calls, having to check
>>         multiple errors and adding complexity and limiting
>>         readability of the code. Many of these situations could be
>>         solved using functions similar to what I proposed.
>>         On the other side, if dict_t must be truly considered a
>>         concurrent structure, there are a lot of race conditions that
>>         might appear when doing some operations. It would require a
>>         great effort to take care of all these possibilities
>>         everywhere. It would be better to pack most of these
>>         situations into functions inside the dict_t itself where it
>>         is easier to combine some operations.
>>         By the way, I've made some tests with multiple bricks and it
>>         seems that there is a clear speed loss on directory listings
>>         as the number of bricks increases. Since bricks should be
>>         independent and they can work in parallel, I didn't expected
>>         such a big performance degradation.
>>     The likely reason is that, even though bricks are parallel for
>>     IO, readdir is essentially a sequential operation and DHT has a
>>     limitation that a readdir reply batch does not cross server
>>     boundaries. So if you have 10 files and 1 server, all 10 entries
>>     are returned in one call to the app/libc. If you have 10 files
>>     and 10 servers evenly distributed, the app/libc has to perform 10
>>     calls and keeps getting one file at a time. This problem goes
>>     away when each server has enough files to fill up a readdir
>>     batch. It's only when you have too few files and too many servers
>>     that this "dilution" problem shows up. However, this is just a
>>     theory and your problem may be something else too..
>     I didn't know that DHT was doing a sequential brick scan on
>     readdir(p) (my fault). Why is that ? Why it cannot return entries
>     crossing a server boundary ? is it due to a technical reason or is
>     it only due to the current implementation ?
>     I've made a test using only directories (50 directories with 50
>     subdirectories each). I started with one brick and I measured the
>     time to do a recursive 'ls'. Then I sequentially added an
>     additional brick, up to 6 (all of them physically independent),
>     and repeated the ls. The time increases linearly as the number of
>     bricks augments. As more bricks were added, the rebalancing time
>     was also growing linearly.
>     I think this is a big problem for scalability. It can be partially
>     hidden by using some caching or preloading mechanisms, but it will
>     be there and it will hit sooner or later.
>>     Note that Brian Foster's readdir-ahead patch should address this
>>     problem to a large extent. When loaded on top of DHT, the
>>     prefiller effectively collapses the smaller chunks returned by
>>     DHT into a larger chunk requested by the app/libc.
>     I've seen it, however I think it will only partially mitigate and
>     hide an existing problem. Imagine you have some hundreds or a
>     thousand of bricks. I doubt readdir-ahead or anything else can
>     hide the enormous latency that the sequential DHT scan will
>     generate in that case.
>     The main problem I see is that the full directory structure is
>     read many times sequentially. I think it would be better to do the
>     readdir(p) calls in parallel and combine them (possibly in
>     background). This way the time to scan the directory structure
>     would be almost constant, independently of the number of bricks.
> The design of the directory entries in DHT makes this essentially a 
> sequential operation because entries from servers are appended, not 
> striped. What I mean is, the logical ordering of
> All entries in a directory = All files and dirs in 0th server + All 
> files (no dirs) in 1st server + All files (no dirs) in 2nd server + .. 
> + All files (no dirs) in N'th server.
> in a sequential manner. If we read the entries of 2nd server along 
> with entries of 1st server, we cannot "use" it till we finish reading 
> all entries of 1st server and get EOD from it - which is why 
> readdir-ahead is a more natural solution than reading in parallel for 
> the above design.
As I understand it, what the read-ahead translator does is to collect 
one or more answers from the DHT translator and combine them to return a 
single answer as big as possible. If that is correct, it will certainly 
reduce the number of readdir calls from application, however I think it 
will still have a considerable latency when used on big clusters. Anyway 
I don't have any measurement or valid argument to support this, so lets 
see how readdir-ahead works in real environments before discussing about it.

> Also, this is a problem only if each server has fewer entries than 
> what can be returned in a single readdir() request by the application. 
> As long as the server has more than this "minimum threshold" of number 
> of files, the number of batched readdir() made by the client is going 
> to be fixed, and those various requests will be spread across various 
> servers (as opposed to, sending them all to the same server).
I've seen customers with large amounts of empty, or almost empty, 
directories. Don't ask me why, I don't understand it either...

> So yes, as you add servers for a given small set of files the 
> scalability drops, but that is only till you create more files, when 
> the # of servers stop mattering again.
> Can you share the actual numbers from the tests you ran?
I've made the tests in 6 physical servers (Quad Atom D525 1.8 GHz. These 
are the only servers I can use regularly to do tests) connected through 
a dedicated 1 Gbit switch. Bricks are stored in 1TB SATA disks with ZFS. 
One of the servers was also used as a client to do the tests.

Initially I created a volume with a single brick. I initialized the 
volume with 50 directories with 50 subdirectories each (a total of 2500 
directories). No files.

After each test, I added a new brick and started a rebalance. Once the 
rebalance was completed, I umounted and stopped the volume and restarted 
it again.

The test consisted of 4 'time ls -lR /<testdir> | wc -l'. The first 
result was discarded. The result shown below is the mean of the other 3 

1 brick: 11.8 seconds
2 bricks: 19.0 seconds
3 bricks: 23.8 seconds
4 bricks: 29.8 seconds
5 bricks: 34.6 seconds
6 bricks: 41.0 seconds
12 bricks (2 bricks on each server): 78.5 seconds

The rebalancing time also grew considerably (these times are the result 
of a single rebalance. They might not be very accurate):

 From 1 to 2 bricks: 91 seconds
 From 2 to 3 bricks: 102 seconds
 From 3 to 4 bricks: 119 seconds
 From 4 to 5 bricks: 138 seconds
 From 5 to 6 bricks: 151 seconds
 From 6 to 12 bricks: 259 seconds

The number of disk IOPS didn't exceed 40 in any server in any case. The 
network bandwidth didn't go beyond 6 Mbits/s between any pair of servers 
and none of them reached 100% core usage.


> Avati
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at nongnu.org
> https://lists.nongnu.org/mailman/listinfo/gluster-devel

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-devel/attachments/20130909/5881f2cf/attachment-0001.html>

More information about the Gluster-devel mailing list