[Bugs] [Bug 1478411] Directory listings on fuse mount are very slow due to small number of getdents () entries

Sun Sep 17 01:57:39 UTC 2017

https://bugzilla.redhat.com/show_bug.cgi?id=1478411

--- Comment #11 from nh2 <nh2-redhatbugzilla at deditus.de> ---
So for SSHFS we got an improvement from ~113 MB/s to ~854  MB/s getdents()
performance.

Now running a similar test on gluster, for time's sake with only 10K files.

My gluster config:

  Volume Name: myvol
  Type: Replicate
  Volume ID: ...
  Status: Started
  Snapshot Count: 0
  Number of Bricks: 1 x 3 = 3
  Transport-type: tcp
  Bricks:
  Brick1: 10.0.0.1:/data/brick
  Brick2: 10.0.0.2:/data/brick
  Brick3: 10.0.0.3:/data/brick
  Options Reconfigured:
  nfs.disable: on
  transport.address-family: inet
  client.ssl: on
  server.ssl: on
  storage.linux-aio: on
  performance.io-thread-count: 64
  performance.readdir-ahead: on
  server.event-threads: 32
  client.event-threads: 32
  server.outstanding-rpc-limit: 64
  cluster.lookup-unhashed: auto
  performance.flush-behind: on
  performance.strict-write-ordering: off
  performance.high-prio-threads: 64
  performance.normal-prio-threads: 64
  performance.low-prio-threads: 64
  performance.write-behind-window-size: 10MB
  cluster.ensure-durability: on
  performance.lazy-open: yes
  cluster.use-compound-fops: off
  performance.open-behind: on
  features.cache-invalidation: off
  performance.quick-read: off
  performance.read-ahead: off
  performance.stat-prefetch: off
  changelog.rollover-time: 1
  geo-replication.indexing: on
  geo-replication.ignore-pid-check: on
  changelog.changelog: on

Files created like

  # touch
/glustermount/largenames-10k/1234567890123456789012345678901234567890-file{1..10000}

Also doing quickly repeated runs to obtain these numbers.
All of these benchmarks were taken on an AWS t2.small medium machine (these
machines are generally bad for benchmarking, but I've checked that the results
are consistent; and importantly the numbers are relative to each other).

Without kernel patch (1 page):

  # strace -w -c -f -e getdents ./listdir-silent /glustermount/largenames-10k/
  % time     seconds  usecs/call     calls    errors syscall
  ------ ----------- ----------- --------- --------- ----------------
  100.00    0.213868         384       557           getdents

With 32-page kernel patch:

  # strace -w -c -f -e getdents ./listdir-silent /glustermount/largenames-10k
  % time     seconds  usecs/call     calls    errors syscall
  ------ ----------- ----------- --------- --------- ----------------
  100.00    0.211732       11763        18           getdents

Almost no improvement!

Let's look at the individual getdents() invocations:

Without kernel patch (1 page):

  # strace -Tttt -f -e getdents ./listdir-silent /glustermount/largenames-10k/
  1505608150.612771 getdents(3, /* 19 entries */, 131072) = 1272 <0.007789>
  -- notice large initial time above --
  1505608150.620606 getdents(3, /* 18 entries */, 131072) = 1296 <0.000117>
  1505608150.620738 getdents(3, /* 18 entries */, 131072) = 1296 <0.000090>
  1505608150.620842 getdents(3, /* 18 entries */, 131072) = 1296 <0.000084>
  -- 30 similar lines omitted --
  1505608150.623666 getdents(3, /* 18 entries */, 131072) = 1296 <0.010920>
  -- notice large time above --
  1505608150.634608 getdents(3, /* 18 entries */, 131072) = 1296 <0.000079>
  1505608150.634701 getdents(3, /* 18 entries */, 131072) = 1296 <0.000083>

With 32-page kernel patch:

  # strace -Tttt -f -e getdents ./listdir-silent /glustermount/largenames-10k/
  1505608076.391872 getdents(3, /* 604 entries */, 131072) = 43392 <0.022552>
  1505608076.414921 getdents(3, /* 602 entries */, 131072) = 43344 <0.001411>
  1505608076.416477 getdents(3, /* 601 entries */, 131072) = 43272 <0.025793>
  1505608076.442688 getdents(3, /* 599 entries */, 131072) = 43128 <0.000730>
  1505608076.443521 getdents(3, /* 601 entries */, 131072) = 43272 <0.024431>
  1505608076.468360 getdents(3, /* 599 entries */, 131072) = 43128 <0.001108>
  -- 12 similar lines omitted --

Observations:

1) Every second getdents() in the 32-page case is super fast. I suspect that
this is the effect of `readdir-ahead: on` (wasn't that also 128 KB, I read
somewhere?). Same thing for the 1-page case, where only every ~30th getdents()
hits the network to glusterfsd.
2) Even in the 32-page case, not the entire buffer provided to getdents()
(131072 bytes) are filled, only approximately 1/3 is used. Same thing for the
1-page case, about 1/3 of the 4K fuse buffer is filled. When I first saw this I
attributed it to the "fuse_direntplus vs linux_dirent" topic that Poornima
mentioned, but SSHFS doesn't seem to have this problem and always fills its
buffers, so maybe gluster is doing something strange here?
3) The performance is at a miserable ~1.7 MB/s in both cases, much worse than
sshfs.

So I started investigating what it (the glusterfs FUSE mount process) is doing.

First thing I noticed: I was wrong when I said

> and as you say, no networking is involved

after doing my test with `tc`: I had only tested that no inter-brick, or FUSE
to other-brick networking was happening.

But what is certainly happening is localhost networking!

If I use

  # tc qdisc add dev lo root netem delay 200ms

thus making my `ping localhost` be 400 ms, then all getents() calls get
significantly slower:

  # strace -Tttt -f -e getdents ./listdir-silent /mount/glustermount-10k
  1505580070.060286 getdents(3, /* 604 entries */, 131072) = 43392 <0.824180>
  1505580070.885190 getdents(3, /* 602 entries */, 131072) = 43344 <0.000909>
  1505580070.886592 getdents(3, /* 601 entries */, 131072) = 43272 <0.824842>
  1505580071.711986 getdents(3, /* 599 entries */, 131072) = 43128 <0.000693>
  1505580071.713164 getdents(3, /* 601 entries */, 131072) = 43272 <0.826009>
  1505580072.539722 getdents(3, /* 599 entries */, 131072) = 43128 <0.000702>

So there is networking happening, in my case it just happened over the `lo`
interface instead of over `eth0` because the machine on which I have the mount
is also one of the 3-replica bricks!

A quick `ping localhost` shows that localhost latency is ~50 us; I find that is
a lot (but apparently that's just how it is), and it's probably not very
relevant here (at least the following strace doesn't indicate so on the
userspace level).

So I ran strace against te `glusterfs` FUSE process:

  # strace -tttT -f -p THEPID

What I could see was a FUSE request/response loop:
There is a readv() against /dev/fuse started, with which the process asks fuse
for the next operation (e.g. a READDIR opcode); this readv() blocks as long as
the FUSE kernel side is still waiting on us, the userspace process, to send it
the result of the currently running operation. Once we send it (e.g. the
contents of a readdir response) with a writev(), the readv() returns and gives
us the next opcode to work on.

As a result, the fulfillment of each getdents() request is bracketed between a
`<... readv resumed>` and a final `writev(8, ...)` syscall.

I have posted one of those brackets at

  https://gist.github.com/nh2/163ffea5bdc16b3a509c4b262b1d382a

Each such getdents() fulfillment takes ~29 ms. I find that quite a lot just to
fill in ~40000 Byte above, so I analysed what major time sinks are within these
29 ms in this strace output of the `glusterfs` FUSE mount process. You can find
that analysis here:

https://gist.github.com/nh2/163ffea5bdc16b3a509c4b262b1d382a#gistcomment-2205164

Quite some CPU time is spent (again, seemingly way too much for providing 40000
Bytes), but also significant time is spent waiting on communication with the
glusterfsd socket (which is FD 10), with poll() and apparently a blocking
write().

(There's also some precisely sized reads, such as those 5-byte reads, for which
I wonder if they could be combined with larger reads to reduce the amount of
syscalls, but they are so fast compared to the rest that's going on that they
don't matter for the current investigation).

As there's lots of waiting for glusterfsd in there, I started to strace
glusterfsd instead, while running `./listdir-silent` against the fuse mount.

That immediately revealed why getdents() on glusterfs is so much slower than on
SSHFS (sorry long lines):

  [pid   972] 1505589830.074176
lstat("/data/brick/.glusterfs/00/00/00000000-0000-0000-0000-000000000001/largenames-10k/1234567890123456789012345678901234567890-file7552",
{st_mode=S_IFREG|0644, st_size=0, ...}) = 0 <0.000007>
  [pid   972] 1505589830.074219
lgetxattr("/data/brick/.glusterfs/00/00/00000000-0000-0000-0000-000000000001/largenames-10k/1234567890123456789012345678901234567890-file7552",
"tr"..., 0x7f5bf697e370, 255) = -1 ENODATA (No data available) <0.000008>
  [pid   972] 1505589830.074262
lgetxattr("/data/brick/.glusterfs/00/00/00000000-0000-0000-0000-000000000001/largenames-10k/1234567890123456789012345678901234567890-file7552",
"tr"..., 0x7f5bf697e370, 255) = -1 ENODATA (No data available) <0.000007>
  [pid   972] 1505589830.074302
lgetxattr("/data/brick/.glusterfs/00/00/00000000-0000-0000-0000-000000000001/largenames-10k/1234567890123456789012345678901234567890-file7552",
"tr"..., 0x7f5bf697e370, 255) = -1 ENODATA (No data available) <0.000006>
  [pid   972] 1505589830.074343
lgetxattr("/data/brick/.glusterfs/00/00/00000000-0000-0000-0000-000000000001/largenames-10k/1234567890123456789012345678901234567890-file7552",
"sy"..., 0x7f5bf697e370, 255) = -1 ENODATA (No data available) <0.000007>
  [pid   972] 1505589830.074383
lgetxattr("/data/brick/.glusterfs/00/00/00000000-0000-0000-0000-000000000001/largenames-10k/1234567890123456789012345678901234567890-file7552",
"sy"..., 0x7f5bf697e370, 255) = -1 ENODATA (No data available) <0.000007>
  [pid   972] 1505589830.074423
lgetxattr("/data/brick/.glusterfs/00/00/00000000-0000-0000-0000-000000000001/largenames-10k/1234567890123456789012345678901234567890-file7552",
"tr"..., 0x7f5bf697e370, 255) = -1 ENODATA (No data available) <0.000007>
  [pid   972] 1505589830.074464
lstat("/data/brick/.glusterfs/00/00/00000000-0000-0000-0000-000000000001/largenames-10k/1234567890123456789012345678901234567890-file7553",
{st_mode=S_IFREG|0644, st_size=0, ...}) = 0 <0.000006>
  -- same lgetxattr() calls as above repeated --

Apparently when gluster tries to fulfil a getdents() FUSE request, each
individual file returned by my XFS brick file system is stat()ed afterwards and
also gets 6 lgetxattr() calls (all returning ENODATA).

I guess that's just part of how gluster works (e.g. to determine based on
xattrs whether a file shall actually be shown in the output or not), but it
wasn't obvious to me when I started debugging this, and it certainly makes the
business much slower than just forwarding some underlying XFS getdents()
results.

As mentioned in my Github gist from before, in contrast to stracing
`glusterfs`, when stracing `glusterfsd` the strace overhead actually becomes
very noticeable (the ./listdir-silent run becomes 5x slower when strace is
attached), and the above -ttt times are not accurate and incorporate this
strace overhead (you see the <...> times emitted by -T, roughly 6 us each,
don't add up to the ~40 us steps -ttt timestamps).

So I switched to measure with `perf` instead, using e.g.

  # perf record -e 'syscalls:sys_*' -p 918

which had 6x overhead for an entire ./listdir-silent run over the 10k files,
and then, for lower overhead, specific syscalls (since then I already knew what
syscalls were going on, so I couldn't miss any in between),

  # perf record -e 'syscalls:sys_enter_newlstat' -e
'syscalls:sys_exit_newlstat' -e 'syscalls:sys_enter_lgetxattr' -e
'syscalls:sys_exit_lgetxattr' -p 918
  # perf script

which shows much less overhead in the profile:

  26239.494406:  syscalls:sys_enter_newlstat
  26239.494408:   syscalls:sys_exit_newlstat
  26239.494409: syscalls:sys_enter_lgetxattr
  26239.494411:  syscalls:sys_exit_lgetxattr
  26239.494412: syscalls:sys_enter_lgetxattr
  26239.494413:  syscalls:sys_exit_lgetxattr
  26239.494414: syscalls:sys_enter_lgetxattr
  26239.494416:  syscalls:sys_exit_lgetxattr
  26239.494417: syscalls:sys_enter_lgetxattr
  26239.494418:  syscalls:sys_exit_lgetxattr
  26239.494419: syscalls:sys_enter_lgetxattr
  26239.494420:  syscalls:sys_exit_lgetxattr
  26239.494421: syscalls:sys_enter_lgetxattr
  26239.494423:  syscalls:sys_exit_lgetxattr
  26239.494424:  syscalls:sys_enter_newlstat
  26239.494426:   syscalls:sys_exit_newlstat

and also increased total run time only by ~2x to ~0.5 seconds.

I suspect that the profile shows that the amount of syscalls gluster makes on
each file (and the fact that it has to make per-file syscalls for getdents())
is somewhat problematic: Each file eats ~18 us, that's roughly 50K files per
second, or 100K per second assuming 2x profiling overhead.
Even if there was no networking involved at all, just stat()ing & getfattr()ing
the 10K files would take 0.1 seconds, which is half of the 0.2 seconds that the
entire ./listdir-silent takes.

I wonder if some future version of Gluster should not store all its metadata
info in extended attributes, as putting it there requires lots of syscalls to
retrieve it? I guess this is one reason for Gluster's low small-file and
directory performance?

I suspect that if it had all this info in memory, or could obtain it via a
few-system-calls-large-reads method (e.g. a single index file or even mmap), it
could do better at this.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.