<div dir="ltr">To follow up on this, I've added an SSD backed hot tier to my cluster and this dramatically improved performance. From observing iostat, it appears that all new files are created on the hot tier and migrated to the cold tier when the demotion daemon runs. Since new files use the hot tier, this avoids the stat() calls on spinning disk, and throughput is much faster for new file creation, especially for small files.<div><br></div><div>-Tom</div></div><div class="gmail_extra"><br><div class="gmail_quote">On Tue, Dec 5, 2017 at 2:14 PM, Tom Fite <span dir="ltr"><<a href="mailto:tomfite@gmail.com" target="_blank">tomfite@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div>Hi all,<br></div><div><br></div><div>I have a distributed / replicated pool consisting of 2 boxes, with 3 bricks a piece. Each brick is mounted via a RAID 6 array consisting of 11 6 TB disks. I'm running CentOS 7 with XFS and LVM. The 150 TB pool is loaded with about 15 TB of data. Clients are connected via FUSE. I'm using glusterfs 3.12.1.</div><div><br></div><div>I've found that running large rsyncs to populate the pool are taking a very long time, specifically with small files (write throughput is fine). I know, I know -- small files on gluster do not perform well, but I'm seeing particularly terrible performance in the range of around 25 to 50 creates per second.</div><div><br></div><div>Profiling and testing indicate the main bottleneck is lstat calls on glusterfs metadata. Running an strace against the glusterd PIDs during a migration shows a lot of lstat calls taking a relatively long time to complete:</div><div><br></div><div>strace -Tfp 3544 -p 3550 -p 3536 2>&1 >/dev/null | awk '{gsub(/[<>]/,"",$NF)}$NF+.0><wbr>0.5' | grep -Ev "futex|epoll|select|nanosleep"</div><div>> 500 ms</div><div>[pid 3748] <... lstat resumed> 0x7f6db004f220) = -1 ENOENT (No such file or directory) 0.773194</div><div>[pid 29234] <... lstat resumed> 0x7f4c500ac220) = -1 ENOENT (No such file or directory) 1.010627</div><div>[pid 13083] <... lstat resumed> 0x7f1c3416c220) = -1 ENOENT (No such file or directory) 0.629203</div><div><br></div><div>These lstats can be traced back to calls that look similar to this:</div><div><br></div><div>[pid 31570] lstat("/data/brick1/gv0/.<wbr>glusterfs/1a/61/1a616193-ddef-<wbr>453b-a86d-dea73c7da496", 0x7f1778067220) = -1 ENOENT (No such file or directory) 0.102771</div><div>[pid 31568] lstat("/data/brick1/gv0/.<wbr>glusterfs/7f/0b/7f0bf1d3-b3e9-<wbr>4009-9692-4e2e55c6c822", 0x7f17780e9220) = -1 ENOENT (No such file or directory) 0.052719</div><div>[pid 31564] lstat("/data/brick2/gv0/.<wbr>glusterfs/b0/49/b049a03c-114a-<wbr>443c-bdfc-71ee981d8e84", 0x7f296c575220) = -1 ENOENT (No such file or directory) 0.195458</div><div><br></div><div>My theory is this: as the gluster pool fills with data, the .glusterfs metadata is scattered around disk, causing random IO seek times to increase. Each file create causes a read / seek in the glusterfs metadata folders to a non existent file, which takes a long time to look up due to the random nature of the directory hashes. If this is the root of the problem, this isn't specifically a problem with gluster per se, but a problem with LVM, XFS, RAID configuration, or my drives.</div><div><br></div><div>This bug report might be the same issue: <a href="https://bugzilla.redhat.com/show_bug.cgi?id=1200457" target="_blank">https://bugzilla.redhat.com/<wbr>show_bug.cgi?id=1200457</a></div><div><br></div><div>I wanted to check with the group to see if anybody else has run into this before if there are suggestions that might help. Specifically --</div><div><br></div><div>1. Would adding an SSD hot tier to my pool help here? Is the glusterfs metadata cached in the hot tier or does hot tiering only cache frequently accessed files in the pool?</div><div><br></div><div>2. I have had some success with forcing the glusterfs dirents into system cache, by running a find in the .glusterfs directory to enumerate and warm the cache with all dirents, eliminating seeks on disk. However, I'm at the mercy of the OS here, and as soon as the dirents are dropped things get slow again. Anybody know of a way to keep the metadata in cache? I have 128 GB of RAM to work with, so I should be able to aggressively cache.</div><div><br></div><div>3. Are giant RAID 6 arrays just not going to perform well here? Would more bricks / smaller array sizes or a different RAID level help?</div><div><br></div><div>4. Would adding more servers to gluster pool help or hurt?</div><div><br></div><div>Here's my glusterfs config, I've been trying every optimization tweak that I can find, including md-cache, bumping up cache sizes, bumping event threads, etc...</div><div><br></div><div>Volume Name: gv0</div><div>Type: Distributed-Replicate</div><div>Volume ID: [ID]</div><div>Status: Started</div><div>Snapshot Count: 13</div><div>Number of Bricks: 3 x 2 = 6</div><div>Transport-type: tcp</div><div>Bricks:</div><div>Brick1: pod-sjc1-gluster1.exavault.<wbr>com:/data/brick1/gv0</div><div>Brick2: pod-sjc1-gluster2.exavault.<wbr>com:/data/brick1/gv0</div><div>Brick3: pod-sjc1-gluster1.exavault.<wbr>com:/data/brick2/gv0</div><div>Brick4: pod-sjc1-gluster2.exavault.<wbr>com:/data/brick2/gv0</div><div>Brick5: pod-sjc1-gluster1.exavault.<wbr>com:/data/brick3/gv0</div><div>Brick6: pod-sjc1-gluster2.exavault.<wbr>com:/data/brick3/gv0</div><div>Options Reconfigured:</div><div>performance.cache-refresh-<wbr>timeout: 60</div><div>performance.stat-prefetch: on</div><div>server.outstanding-rpc-limit: 1024</div><div>cluster.lookup-optimize: on</div><div>performance.client-io-threads: on</div><div>nfs.disable: on</div><div>transport.address-family: inet</div><div>features.barrier: disable</div><div>client.event-threads: 16</div><div>server.event-threads: 16</div><div>performance.cache-size: 4GB</div><div>network.inode-lru-limit: 90000</div><div>performance.md-cache-timeout: 600</div><div>performance.cache-<wbr>invalidation: on</div><div>features.cache-invalidation-<wbr>timeout: 600</div><div>features.cache-invalidation: on</div><div>performance.quick-read: on</div><div>performance.io-cache: on</div><div>performance.nfs.write-behind-<wbr>window-size: 512MB</div><div>performance.write-behind-<wbr>window-size: 4MB</div><div>performance.nfs.io-threads: on</div><div>network.tcp-window-size: 1048576</div><div>performance.rda-cache-limit: 32MB</div><div>performance.flush-behind: on</div><div>server.allow-insecure: on</div><div>auto-delete: enable</div><div><br></div><div>Thanks</div><span class="HOEnZb"><font color="#888888"><div>-Tom</div><div><br></div><div><br></div><div><br></div></font></span></div>
</blockquote></div><br></div>