[Gluster-users] Slow seek times on stat calls to glusterfs metadata

Tue Dec 5 19:14:11 UTC 2017

Hi all,

I have a distributed / replicated pool consisting of 2 boxes, with 3 bricks
a piece. Each brick is mounted via a RAID 6 array consisting of 11 6 TB
disks. I'm running CentOS 7 with XFS and LVM. The 150 TB pool is loaded
with about 15 TB of data. Clients are connected via FUSE. I'm using
glusterfs 3.12.1.

I've found that running large rsyncs to populate the pool are taking a very
long time, specifically with small files (write throughput is fine). I
know, I know -- small files on gluster do not perform well, but I'm seeing
particularly terrible performance in the range of around 25 to 50 creates
per second.

Profiling and testing indicate the main bottleneck is lstat calls on
glusterfs metadata. Running an strace against the glusterd PIDs during a
migration shows a lot of lstat calls taking a relatively long time to
complete:

strace -Tfp 3544 -p 3550 -p 3536 2>&1 >/dev/null | awk
'{gsub(/[<>]/,"",$NF)}$NF+.0>0.5' | grep -Ev "futex|epoll|select|nanosleep"
> 500 ms
[pid 3748] <... lstat resumed> 0x7f6db004f220) = -1 ENOENT (No such file or
directory) 0.773194
[pid 29234] <... lstat resumed> 0x7f4c500ac220) = -1 ENOENT (No such file
or directory) 1.010627
[pid 13083] <... lstat resumed> 0x7f1c3416c220) = -1 ENOENT (No such file
or directory) 0.629203

These lstats can be traced back to calls that look similar to this:

[pid 31570]
lstat("/data/brick1/gv0/.glusterfs/1a/61/1a616193-ddef-453b-a86d-dea73c7da496",
0x7f1778067220) = -1 ENOENT (No such file or directory) 0.102771
[pid 31568]
lstat("/data/brick1/gv0/.glusterfs/7f/0b/7f0bf1d3-b3e9-4009-9692-4e2e55c6c822",
0x7f17780e9220) = -1 ENOENT (No such file or directory) 0.052719
[pid 31564]
lstat("/data/brick2/gv0/.glusterfs/b0/49/b049a03c-114a-443c-bdfc-71ee981d8e84",
0x7f296c575220) = -1 ENOENT (No such file or directory) 0.195458

My theory is this: as the gluster pool fills with data, the .glusterfs
metadata is scattered around disk, causing random IO seek times to
increase. Each file create causes a read / seek in the glusterfs metadata
folders to a non existent file, which takes a long time to look up due to
the random nature of the directory hashes. If this is the root of the
problem, this isn't specifically a problem with gluster per se, but a
problem with LVM, XFS, RAID configuration, or my drives.

This bug report might be the same issue:
https://bugzilla.redhat.com/show_bug.cgi?id=1200457

I wanted to check with the group to see if anybody else has run into this
before if there are suggestions that might help. Specifically --

1. Would adding an SSD hot tier to my pool help here? Is the glusterfs
metadata cached in the hot tier or does hot tiering only cache frequently
accessed files in the pool?

2. I have had some success with forcing the glusterfs dirents into system
cache, by running a find in the .glusterfs directory to enumerate and warm
the cache with all dirents, eliminating seeks on disk. However, I'm at the
mercy of the OS here, and as soon as the dirents are dropped things get
slow again. Anybody know of a way to keep the metadata in cache? I have 128
GB of RAM to work with, so I should be able to aggressively cache.

3. Are giant RAID 6 arrays just not going to perform well here? Would more
bricks / smaller array sizes or a different RAID level help?

4. Would adding more servers to gluster pool help or hurt?

Here's my glusterfs config, I've been trying every optimization tweak that
I can find, including md-cache, bumping up cache sizes, bumping event
threads, etc...

Volume Name: gv0
Type: Distributed-Replicate
Volume ID: [ID]
Status: Started
Snapshot Count: 13
Number of Bricks: 3 x 2 = 6
Transport-type: tcp
Bricks:
Brick1: pod-sjc1-gluster1.exavault.com:/data/brick1/gv0
Brick2: pod-sjc1-gluster2.exavault.com:/data/brick1/gv0
Brick3: pod-sjc1-gluster1.exavault.com:/data/brick2/gv0
Brick4: pod-sjc1-gluster2.exavault.com:/data/brick2/gv0
Brick5: pod-sjc1-gluster1.exavault.com:/data/brick3/gv0
Brick6: pod-sjc1-gluster2.exavault.com:/data/brick3/gv0
Options Reconfigured:
performance.cache-refresh-timeout: 60
performance.stat-prefetch: on
server.outstanding-rpc-limit: 1024
cluster.lookup-optimize: on
performance.client-io-threads: on
nfs.disable: on
transport.address-family: inet
features.barrier: disable
client.event-threads: 16
server.event-threads: 16
performance.cache-size: 4GB
network.inode-lru-limit: 90000
performance.md-cache-timeout: 600
performance.cache-invalidation: on
features.cache-invalidation-timeout: 600
features.cache-invalidation: on
performance.quick-read: on
performance.io-cache: on
performance.nfs.write-behind-window-size: 512MB
performance.write-behind-window-size: 4MB
performance.nfs.io-threads: on
network.tcp-window-size: 1048576
performance.rda-cache-limit: 32MB
performance.flush-behind: on
server.allow-insecure: on
auto-delete: enable

Thanks
-Tom
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20171205/0eb42b4f/attachment.html>