[Gluster-users] Slow seek times on stat calls to glusterfs metadata

Fri Jan 12 17:45:26 UTC 2018

To follow up on this, I've added an SSD backed hot tier to my cluster and
this dramatically improved performance. From observing iostat, it appears
that all new files are created on the hot tier and migrated to the cold
tier when the demotion daemon runs. Since new files use the hot tier, this
avoids the stat() calls on spinning disk, and throughput is much faster for
new file creation, especially for small files.

-Tom

On Tue, Dec 5, 2017 at 2:14 PM, Tom Fite <tomfite at gmail.com> wrote:

> Hi all,
>
> I have a distributed / replicated pool consisting of 2 boxes, with 3
> bricks a piece. Each brick is mounted via a RAID 6 array consisting of 11 6
> TB disks. I'm running CentOS 7 with XFS and LVM. The 150 TB pool is loaded
> with about 15 TB of data. Clients are connected via FUSE. I'm using
> glusterfs 3.12.1.
>
> I've found that running large rsyncs to populate the pool are taking a
> very long time, specifically with small files (write throughput is fine). I
> know, I know -- small files on gluster do not perform well, but I'm seeing
> particularly terrible performance in the range of around 25 to 50 creates
> per second.
>
> Profiling and testing indicate the main bottleneck is lstat calls on
> glusterfs metadata. Running an strace against the glusterd PIDs during a
> migration shows a lot of lstat calls taking a relatively long time to
> complete:
>
> strace -Tfp 3544 -p 3550 -p 3536 2>&1 >/dev/null | awk
> '{gsub(/[<>]/,"",$NF)}$NF+.0>0.5' | grep -Ev
> "futex|epoll|select|nanosleep"
> > 500 ms
> [pid 3748] <... lstat resumed> 0x7f6db004f220) = -1 ENOENT (No such file
> or directory) 0.773194
> [pid 29234] <... lstat resumed> 0x7f4c500ac220) = -1 ENOENT (No such file
> or directory) 1.010627
> [pid 13083] <... lstat resumed> 0x7f1c3416c220) = -1 ENOENT (No such file
> or directory) 0.629203
>
> These lstats can be traced back to calls that look similar to this:
>
> [pid 31570] lstat("/data/brick1/gv0/.glusterfs/1a/61/1a616193-ddef-453b-a86d-dea73c7da496",
> 0x7f1778067220) = -1 ENOENT (No such file or directory) 0.102771
> [pid 31568] lstat("/data/brick1/gv0/.glusterfs/7f/0b/7f0bf1d3-b3e9-4009-9692-4e2e55c6c822",
> 0x7f17780e9220) = -1 ENOENT (No such file or directory) 0.052719
> [pid 31564] lstat("/data/brick2/gv0/.glusterfs/b0/49/b049a03c-114a-443c-bdfc-71ee981d8e84",
> 0x7f296c575220) = -1 ENOENT (No such file or directory) 0.195458
>
> My theory is this: as the gluster pool fills with data, the .glusterfs
> metadata is scattered around disk, causing random IO seek times to
> increase. Each file create causes a read / seek in the glusterfs metadata
> folders to a non existent file, which takes a long time to look up due to
> the random nature of the directory hashes. If this is the root of the
> problem, this isn't specifically a problem with gluster per se, but a
> problem with LVM, XFS, RAID configuration, or my drives.
>
> This bug report might be the same issue: https://bugzilla.redhat.com/
> show_bug.cgi?id=1200457
>
> I wanted to check with the group to see if anybody else has run into this
> before if there are suggestions that might help. Specifically --
>
> 1. Would adding an SSD hot tier to my pool help here? Is the glusterfs
> metadata cached in the hot tier or does hot tiering only cache frequently
> accessed files in the pool?
>
> 2. I have had some success with forcing the glusterfs dirents into system
> cache, by running a find in the .glusterfs directory to enumerate and warm
> the cache with all dirents, eliminating seeks on disk. However, I'm at the
> mercy of the OS here, and as soon as the dirents are dropped things get
> slow again. Anybody know of a way to keep the metadata in cache? I have 128
> GB of RAM to work with, so I should be able to aggressively cache.
>
> 3. Are giant RAID 6 arrays just not going to perform well here? Would more
> bricks / smaller array sizes or a different RAID level help?
>
> 4. Would adding more servers to gluster pool help or hurt?
>
> Here's my glusterfs config, I've been trying every optimization tweak that
> I can find, including md-cache, bumping up cache sizes, bumping event
> threads, etc...
>
> Volume Name: gv0
> Type: Distributed-Replicate
> Volume ID: [ID]
> Status: Started
> Snapshot Count: 13
> Number of Bricks: 3 x 2 = 6
> Transport-type: tcp
> Bricks:
> Brick1: pod-sjc1-gluster1.exavault.com:/data/brick1/gv0
> Brick2: pod-sjc1-gluster2.exavault.com:/data/brick1/gv0
> Brick3: pod-sjc1-gluster1.exavault.com:/data/brick2/gv0
> Brick4: pod-sjc1-gluster2.exavault.com:/data/brick2/gv0
> Brick5: pod-sjc1-gluster1.exavault.com:/data/brick3/gv0
> Brick6: pod-sjc1-gluster2.exavault.com:/data/brick3/gv0
> Options Reconfigured:
> performance.cache-refresh-timeout: 60
> performance.stat-prefetch: on
> server.outstanding-rpc-limit: 1024
> cluster.lookup-optimize: on
> performance.client-io-threads: on
> nfs.disable: on
> transport.address-family: inet
> features.barrier: disable
> client.event-threads: 16
> server.event-threads: 16
> performance.cache-size: 4GB
> network.inode-lru-limit: 90000
> performance.md-cache-timeout: 600
> performance.cache-invalidation: on
> features.cache-invalidation-timeout: 600
> features.cache-invalidation: on
> performance.quick-read: on
> performance.io-cache: on
> performance.nfs.write-behind-window-size: 512MB
> performance.write-behind-window-size: 4MB
> performance.nfs.io-threads: on
> network.tcp-window-size: 1048576
> performance.rda-cache-limit: 32MB
> performance.flush-behind: on
> server.allow-insecure: on
> auto-delete: enable
>
> Thanks
> -Tom
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20180112/3975dc8b/attachment.html>