[Bugs] [Bug 1415917] New: RFE: Make readdirp parallel in dht
bugzilla at redhat.com
bugzilla at redhat.com
Tue Jan 24 04:53:05 UTC 2017
https://bugzilla.redhat.com/show_bug.cgi?id=1415917
Bug ID: 1415917
Summary: RFE: Make readdirp parallel in dht
Product: GlusterFS
Version: 3.10
Component: distribute
Assignee: bugs at gluster.org
Reporter: rgowdapp at redhat.com
CC: bugs at gluster.org, pgurusid at redhat.com
Depends On: 1401812
+++ This bug was initially created as a clone of Bug #1401812 +++
Description of problem:
Currently readdirp is sequential at the dht layer. This make find and recursive
listing of small directories very slow(directory whose content can be
accomodated in one readdirp call, eg: ~600 entries if buf size is 128k).
The number of readdirp fops
required to fetch the ls -l -R for nested directories is:
no. of fops = (x + 1) * m * n
n = number of bricks
m = number of directories
x = number of readdirp calls required to fetch the dentries completely
(this depends on the size of the directory and the readdirp buf size)
1 = readdirp fop that is sent to just detect the end of directory.
Eg: Let's say, to list 800 directories with files ~300 each and readdirp
buf size 128K, on distribute 6:
(1+1) * 800 * 6 = 9600 fops
And all the readdirp fops are not sent in parallel to all the bricks, but
sequentially. This patch is a first step towards making readdirp parallel
With parallel readdirp, the number of fops may not decrease drastically
but since they are issued in parallel, it will increase the throughput.
Why its not a straightforward problem to solve:
One needs to briefly understand, how the directory offset is handled in dht.
[1], [2], [3] are some of the links that will hint the same.
- The d_off is in the order of bricks identfied by dht. Hence, the dentries
should always be returned in the same order as bricks. i.e. brick2 entries
shouldn't be returned before brick1 reaches EOD.
- We cannot store any info of offset read so far etc. in inode_ctx or fd_ctx
- In case of a very large directories, and readdirp buf too small to hold
all the dentries in any brick, parallel readdirp is a overhead. Sequential
readdirp best suits the large directories. This demands dht be aware of or
speculate the directory size.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1.
2.
3.
Actual results:
Expected results:
Additional info:
--- Additional comment from Worker Ant on 2016-12-06 06:06:24 EST ---
REVIEW: http://review.gluster.org/16042 (readdir-ahead: Enhance EOD detection
logic) posted (#1) for review on master by Poornima G (pgurusid at redhat.com)
--- Additional comment from Worker Ant on 2016-12-07 02:12:29 EST ---
REVIEW: http://review.gluster.org/16039 (dht: At places needed use
STACK_WIND_COOKIE) posted (#2) for review on master by Poornima G
(pgurusid at redhat.com)
--- Additional comment from Worker Ant on 2016-12-08 04:30:02 EST ---
REVIEW: http://review.gluster.org/16068 (readdir-ahead : Perform STACK_UNWIND
outside of mutex locks) posted (#1) for review on master by Poornima G
(pgurusid at redhat.com)
--- Additional comment from Worker Ant on 2016-12-08 05:40:27 EST ---
REVIEW: http://review.gluster.org/16071 (Readdir-ahead : Honor readdir-optimise
option of dht) posted (#1) for review on master by Poornima G
(pgurusid at redhat.com)
--- Additional comment from Worker Ant on 2016-12-08 06:29:59 EST ---
REVIEW: http://review.gluster.org/16072 (glusterd: Change the volfile to have
readdir-ahead as a child of dht) posted (#1) for review on master by
Poornima G (pgurusid at redhat.com)
--- Additional comment from Worker Ant on 2016-12-29 03:45:41 EST ---
REVIEW: http://review.gluster.org/16068 (readdir-ahead : Perform STACK_UNWIND
outside of mutex locks) posted (#2) for review on master by Poornima G
(pgurusid at redhat.com)
--- Additional comment from Worker Ant on 2016-12-29 03:46:49 EST ---
REVIEW: http://review.gluster.org/16042 (readdir-ahead: Enhance EOD detection
logic) posted (#2) for review on master by Poornima G (pgurusid at redhat.com)
--- Additional comment from Worker Ant on 2016-12-29 03:47:05 EST ---
REVIEW: http://review.gluster.org/16039 (dht: At places needed use
STACK_WIND_COOKIE) posted (#3) for review on master by Poornima G
(pgurusid at redhat.com)
--- Additional comment from Worker Ant on 2016-12-31 06:14:19 EST ---
REVIEW: http://review.gluster.org/16311 (dht: reorg hash calculation, dht
server stub xlator will be using it, to filter unhashed directories in
readdirp) posted (#1) for review on master by Poornima G (pgurusid at redhat.com)
--- Additional comment from Worker Ant on 2016-12-31 06:14:22 EST ---
REVIEW: http://review.gluster.org/16312 (dht-server-stub: Add a server side dht
to filter the unhashed dir in readdirp) posted (#1) for review on master by
Poornima G (pgurusid at redhat.com)
--- Additional comment from Worker Ant on 2016-12-31 06:34:35 EST ---
REVIEW: http://review.gluster.org/16311 (dht: move hash calculation code to
libglusterfs) posted (#2) for review on master by Poornima G
(pgurusid at redhat.com)
--- Additional comment from Worker Ant on 2016-12-31 06:34:37 EST ---
REVIEW: http://review.gluster.org/16312 (dht-server-stub: Add a server side dht
to filter the unhashed dir in readdirp) posted (#2) for review on master by
Poornima G (pgurusid at redhat.com)
--- Additional comment from Worker Ant on 2016-12-31 06:44:28 EST ---
REVIEW: http://review.gluster.org/16068 (readdir-ahead : Perform STACK_UNWIND
outside of mutex locks) posted (#3) for review on master by Poornima G
(pgurusid at redhat.com)
--- Additional comment from Worker Ant on 2017-01-01 00:10:44 EST ---
REVIEW: http://review.gluster.org/16311 (dht: move hash calculation code to
libglusterfs) posted (#3) for review on master by Poornima G
(pgurusid at redhat.com)
--- Additional comment from Worker Ant on 2017-01-01 00:10:46 EST ---
REVIEW: http://review.gluster.org/16312 (dht-server-stub: Add a server side dht
to filter the unhashed dir in readdirp) posted (#3) for review on master by
Poornima G (pgurusid at redhat.com)
--- Additional comment from Worker Ant on 2017-01-01 03:00:04 EST ---
REVIEW: http://review.gluster.org/16312 (dht-server-stub: Add a server side dht
to filter the unhashed dir in readdirp) posted (#4) for review on master by
Poornima G (pgurusid at redhat.com)
--- Additional comment from Worker Ant on 2017-01-05 06:15:13 EST ---
REVIEW: http://review.gluster.org/16039 (dht: At places needed use
STACK_WIND_COOKIE) posted (#4) for review on master by Poornima G
(pgurusid at redhat.com)
--- Additional comment from Worker Ant on 2017-01-05 07:57:13 EST ---
REVIEW: http://review.gluster.org/16068 (readdir-ahead : Perform STACK_UNWIND
outside of mutex locks) posted (#4) for review on master by Poornima G
(pgurusid at redhat.com)
--- Additional comment from Worker Ant on 2017-01-05 23:27:55 EST ---
COMMIT: http://review.gluster.org/16042 committed in master by Raghavendra G
(rgowdapp at redhat.com)
------
commit f60631904defdaec2f1bae84b3cfd6a3e083cf09
Author: Poornima G <pgurusid at redhat.com>
Date: Tue Dec 6 16:31:51 2016 +0530
readdir-ahead: Enhance EOD detection logic
Issue:
Currently end of directory is identified on obtaining a
readdirp_cbk with op_ret = 0 (i.e. 0 entries fetched in
readdirp). Thus an extra readdirp is required for every
directory just to identify EOD. Consider a case of listing
large number of small directories. The readdirp fops required
are doubled in that case.
Solution:
On reaching the EOD, posix sets the op_errno to ENOENT,
hence along with looking for 'op_ret == 0' we also
look for 'operrno == ENOENT' ehile checking for EOD condition
Change-Id: I7a5b52e7b98f5dc236c387635fcc651dac0171b3
BUG: 1401812
Signed-off-by: Poornima G <pgurusid at redhat.com>
Reviewed-on: http://review.gluster.org/16042
NetBSD-regression: NetBSD Build System <jenkins at build.gluster.org>
CentOS-regression: Gluster Build System <jenkins at build.gluster.org>
Smoke: Gluster Build System <jenkins at build.gluster.org>
Reviewed-by: Raghavendra G <rgowdapp at redhat.com>
--- Additional comment from Worker Ant on 2017-01-09 02:17:33 EST ---
REVIEW: http://review.gluster.org/16068 (readdir-ahead : Perform STACK_UNWIND
outside of mutex locks) posted (#5) for review on master by Poornima G
(pgurusid at redhat.com)
--- Additional comment from Worker Ant on 2017-01-09 02:24:06 EST ---
REVIEW: http://review.gluster.org/16039 (dht: At places needed use
STACK_WIND_COOKIE) posted (#5) for review on master by Poornima G
(pgurusid at redhat.com)
--- Additional comment from Worker Ant on 2017-01-09 03:31:59 EST ---
REVIEW: http://review.gluster.org/16068 (readdir-ahead : Perform STACK_UNWIND
outside of mutex locks) posted (#6) for review on master by Poornima G
(pgurusid at redhat.com)
--- Additional comment from Worker Ant on 2017-01-09 04:24:45 EST ---
REVIEW: http://review.gluster.org/16360 (posix, readdir-ahead: Handle genuine
ENOENT errors in readdirp) posted (#1) for review on master by Poornima G
(pgurusid at redhat.com)
--- Additional comment from Worker Ant on 2017-01-09 13:37:06 EST ---
COMMIT: http://review.gluster.org/16039 committed in master by Jeff Darcy
(jdarcy at redhat.com)
------
commit a988741713752c2ec04a0680224d8fa4d42dc203
Author: Poornima G <pgurusid at redhat.com>
Date: Tue Dec 6 14:43:10 2016 +0530
dht: At places needed use STACK_WIND_COOKIE
Issue:
frame has a void * cookie pointer.
In case of STACK_WIND_COOKIE frame->cookie is assigned
to what is sent by the caller.
In case of STACK_WIND frame->cookie is assigned to point
point to the frame itself.
For ease of coding, at many places, the cookie in the cbk
is used to get the pointer to the next xl. This is
inconsistent when STACK_WIND_TAIL comes into picture.
Eg: dht_setxattr () {
for (i = 0 ; i < conf->subvolume_cnt ; i++) {
STACK_WIND (..dht_checking_pathinfo_cbk,
conf->subvolumes[i] ..);
}
dht_checking_pathinfo_cbk (...void *cookie...) {
prev = cookie;
...
for (i = 0; i < conf->subvolume_cnt; i++) {
if (conf->subvolumes[i] == prev->this) {
...
}
}
}
Consider the below graph:
dht (parent)
readdir-ahead => Doesn't define setxattr and uses STACK_WIND_TAIL
protocol-client
With this graph, when dht_checking_pathinfo_cbk is called,
cookie will have frame pointing to protocol-client.
i.e. prev->this will be protocol-client. But dht was expecting
it to be readdir-ahead as it has stored in conf->subvolumes[i]
Solution:
Hence, as a thumb rule, if cbk is using cookie, then we explicitly
call STACK_WIND_COOKIE.
Change-Id: I83aea1e24c809c5a91a0db7283e908e125471bd4
BUG: 1401812
Signed-off-by: Poornima G <pgurusid at redhat.com>
Reviewed-on: http://review.gluster.org/16039
Reviewed-by: Raghavendra G <rgowdapp at redhat.com>
Smoke: Gluster Build System <jenkins at build.gluster.org>
NetBSD-regression: NetBSD Build System <jenkins at build.gluster.org>
CentOS-regression: Gluster Build System <jenkins at build.gluster.org>
--- Additional comment from Worker Ant on 2017-01-09 23:52:33 EST ---
COMMIT: http://review.gluster.org/16068 committed in master by Raghavendra G
(rgowdapp at redhat.com)
------
commit c89a065af2adc11d5aca3a4500d2e8c1ea02ed28
Author: Poornima G <pgurusid at redhat.com>
Date: Mon Jan 9 09:55:26 2017 +0530
readdir-ahead : Perform STACK_UNWIND outside of mutex locks
Currently STACK_UNWIND is performnd within ctx->lock.
If readdir-ahead is loaded as a child of dht, then there
can be scenarios where the function calling STACK_UNWIND
becomes re-entrant. Its a good practice to not call
STACK_WIND/UNWIND within local mutex's
Change-Id: If4e869849d99ce233014a8aad7c4d5eef8dc2e98
BUG: 1401812
Signed-off-by: Poornima G <pgurusid at redhat.com>
Reviewed-on: http://review.gluster.org/16068
Smoke: Gluster Build System <jenkins at build.gluster.org>
NetBSD-regression: NetBSD Build System <jenkins at build.gluster.org>
CentOS-regression: Gluster Build System <jenkins at build.gluster.org>
Reviewed-by: Jeff Darcy <jdarcy at redhat.com>
Reviewed-by: Raghavendra G <rgowdapp at redhat.com>
--- Additional comment from Worker Ant on 2017-01-10 04:08:53 EST ---
REVIEW: http://review.gluster.org/16072 (glusterd: Change the volfile to have
readdir-ahead as a child of dht) posted (#2) for review on master by
Poornima G (pgurusid at redhat.com)
--- Additional comment from Worker Ant on 2017-01-16 01:22:08 EST ---
REVIEW: http://review.gluster.org/16072 (glusterd: Change the volfile to have
readdir-ahead as a child of dht) posted (#3) for review on master by
Poornima G (pgurusid at redhat.com)
--- Additional comment from Worker Ant on 2017-01-17 03:14:56 EST ---
REVIEW: http://review.gluster.org/16072 (glusterd: Change the volfile to have
readdir-ahead as a child of dht) posted (#4) for review on master by
Poornima G (pgurusid at redhat.com)
--- Additional comment from Worker Ant on 2017-01-17 23:05:16 EST ---
COMMIT: http://review.gluster.org/16072 committed in master by Pranith Kumar
Karampuri (pkarampu at redhat.com)
------
commit 2c03c753fe77dfadb7660ecb39fe0bbb6bad026f
Author: Poornima G <pgurusid at redhat.com>
Date: Thu Dec 8 16:48:55 2016 +0530
glusterd: Change the volfile to have readdir-ahead as a child
of dht
As mentioned in feature page http://review.gluster.org/#/c/16090/
readdir-ahead will be optionally placed below dht.
There are two options:
1. performance.readdir-ahead
2. performance.parallel-readdir
If only option is enabled, then readdir ahead is placed at its
original place as an ancestor of dht. If both the options 1 and 2
are enabled then readdir ahead is placed as a child of dht.
Also changes have been made to retain the rebalance, quotad,
snapd vol files to remain unchanged.
Change-Id: I0adf0b476fcbf91251f5a2fee2241786a3d8255a
BUG: 1401812
Signed-off-by: Poornima G <pgurusid at redhat.com>
Reviewed-on: http://review.gluster.org/16072
Reviewed-by: Raghavendra G <rgowdapp at redhat.com>
Smoke: Gluster Build System <jenkins at build.gluster.org>
NetBSD-regression: NetBSD Build System <jenkins at build.gluster.org>
CentOS-regression: Gluster Build System <jenkins at build.gluster.org>
Reviewed-by: Atin Mukherjee <amukherj at redhat.com>
--- Additional comment from Worker Ant on 2017-01-17 23:29:58 EST ---
REVIEW: http://review.gluster.org/16424 (glusterd, rda: If parallel readdir is
enabled, split the cache limit) posted (#1) for review on master by Poornima G
(pgurusid at redhat.com)
--- Additional comment from Worker Ant on 2017-01-17 23:51:02 EST ---
REVIEW: http://review.gluster.org/16071 (Readdir-ahead : Honor readdir-optimise
option of dht) posted (#2) for review on master by Poornima G
(pgurusid at redhat.com)
--- Additional comment from Worker Ant on 2017-01-18 00:40:22 EST ---
REVIEW: http://review.gluster.org/16071 (Readdir-ahead : Honor readdir-optimise
option of dht) posted (#3) for review on master by Poornima G
(pgurusid at redhat.com)
--- Additional comment from Worker Ant on 2017-01-18 01:13:40 EST ---
REVIEW: http://review.gluster.org/16071 (Readdir-ahead : Honor readdir-optimise
option of dht) posted (#4) for review on master by Poornima G
(pgurusid at redhat.com)
--- Additional comment from Worker Ant on 2017-01-19 02:40:45 EST ---
REVIEW: http://review.gluster.org/16424 (glusterd, rda: If parallel readdir is
enabled, split the cache limit) posted (#2) for review on master by Poornima G
(pgurusid at redhat.com)
Referenced Bugs:
https://bugzilla.redhat.com/show_bug.cgi?id=1401812
[Bug 1401812] RFE: Make readdirp parallel in dht
--
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.
More information about the Bugs
mailing list