<div dir="ltr"><div>Hi Zenon,</div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, Mar 5, 2021 at 4:52 PM Zenon Panoussis <<a href="mailto:oracle@provocation.net">oracle@provocation.net</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><br>
Some time ago I created a replica 3 volume using gluster 8.3<br>
with the following topology for the time being:<br>
<br>
server1/brick1 ----\ /---- server3/brick3<br>
\____ ADSL 10/1 Mbits ___/<br>
/ <- down up -> \<br>
server2/brick2 ----/ \---- old storage<br>
<br>
<br>
The connection between the two boxes at each end is 1Gbit.<br>
The distance between the two sides is about 4000 km and<br>
roughly 250ms.<br>
<br>
For the past one and a half month I have been running one rsync<br>
on each of the three servers to fetch different parts of a<br>
mail store from "old storage". The mail store consists of<br>
about 1.1 million predominantly small files very unevenly<br>
spread over 6600 directories. Some directories contain 30000+<br>
files, the worst one has 90000+.<br>
<br>
Copying simultaneously to all three servers wastes traffic<br>
(what is rsynced to server1 and server2 has to travel down<br>
from old storage and then back up again to server3), but<br>
uses the available bandwidth more efficiently (by using<br>
both directions instead of only down, as the case would be<br>
if I only rsynced to server3 and let the replication flow<br>
down to servers 1 and 2). I did this because, as I mentioned<br>
earlier in the thread "Replication logic", I cannot saturate<br>
any of CPU, disk I/O or even the meager network. This way<br>
the waste of traffic increases the overall speed of copying.<br>
Diagnostics showed that FSYNC had by far the greatest average<br>
latency, followed by MKDIR and CREATE, but they all had<br>
relatively few calls. LOOKUP is what has a huge number of<br>
calls so, even with a moderate average latency, it accounts<br>
for the greatest overall delay, followed by INODELK.<br>
<br>
I tested writing both to glusterfs and nfs-ganesha, but<br>
didn't notice any difference between them in speed (however,<br>
nfs-ganesha used seven times more memory than glusterfsd).<br>
Tweaking threads, write-behind, parallel-readdir, cache-size<br>
and inode-lru-limit didn't produce any noticeable difference<br>
either.<br>
<br>
Then a few days ago I noticed global-threading at<br>
<a href="https://github.com/gluster/glusterfs/issues/532" rel="noreferrer" target="_blank">https://github.com/gluster/glusterfs/issues/532</a> . It<br>
seemed promising but not merged, but it turned out that<br>
it is actually merged. So last night I upgraded to 9.0<br>
and turned it on. I also dumped nfs-ganesha. With that,<br>
my configuration ended up like this:<br>
<br>
Volume Name: gv0<br>
Type: Replicate<br>
Volume ID: 2786efab-9178-4a9a-a525-21d6f1c94de9<br>
Status: Started<br>
Snapshot Count: 0<br>
Number of Bricks: 1 x 3 = 3<br>
Transport-type: tcp<br>
Bricks:<br>
Brick1: node1:/gfs/gv0<br>
Brick2: node2:/gfs/gv0<br>
Brick3: node3:/gfs/gv0<br>
Options Reconfigured:<br>
cluster.granular-entry-heal: enable<br>
network.ping-timeout: 20<br>
network.frame-timeout: 60<br>
performance.write-behind: on<br>
storage.fips-mode-rchecksum: on<br>
transport.address-family: inet<br>
nfs.disable: on<br>
performance.client-io-threads: off<br>
features.bitrot: off<br>
features.scrub: Inactive<br>
features.scrub-freq: weekly<br>
performance.io-thread-count: 32<br>
features.selinux: off<br>
client.event-threads: 3<br>
server.event-threads: 3<br>
cluster.min-free-disk: 1%<br>
features.cache-invalidation: on<br>
features.cache-invalidation-timeout: 600<br>
performance.cache-invalidation: on<br>
cluster.self-heal-daemon: enable<br>
diagnostics.latency-measurement: on<br>
diagnostics.count-fop-hits: on<br>
performance.cache-size: 256MB<br>
network.inode-lru-limit: 131072<br>
performance.parallel-readdir: on<br>
performance.qr-cache-timeout: 600<br>
performance.nl-cache-positive-entry: on<br>
performance.nfs.io-threads: on<br>
config.global-threading: on<br>
performance.iot-pass-through: on<br>
<br>
In the short time it's been running since, I saw no<br>
subjectively noticeable increase in the speed of<br>
writing, but I do see some increase in the speed of<br>
file listing (that is, the speed at which rsync<br>
without --whole-file will run through preexisting<br>
files while reporting "file X is uptodate"). This<br>
is presumably stat working faster because of thread<br>
parallelisation, but I'm only guessing. The network<br>
still does not get saturated except during the<br>
transfer of some occasional big (5MB+) files. So<br>
far I have seen no negative impact of turning global<br>
threading on compared to previously.<br>
<br>
Any and all ideas on how to improve this setup (other<br>
than physically) are most welcome.<br></blockquote><div><br></div><div>The main issue with the global threading is that it's not regularly tested, so it could have unknown bugs. Besides that are you using it both on client and bricks, or only on the client ?</div><div><br></div><div>I think the main problem with rsync is that it's mostly a sequential program that does many small requests. In this case it's hard to saturate the network because the roundtrip latency of sequential operations is what dominates.</div><div><br></div><div>To improve that you could try to run several rsync processes in parallel. That should make better use of the bandwidth. Gluster normally works better with parallel operations. It's not so good with single sequential operations.</div><div><br></div><div>Another thing you could try is to increase the timeout of kernel cache using "entry-timeout" and "attribute-timeout" mount options. By default they are set to 1. A higher value could help reduce the number of lookups. However this could cause some delays detecting changes or even create inconsistencies for worst cases. This should only be used when there's a single fuse mount using the volume. As the global threading feature, using higher values here has not been tested, so it could have other unexpected problems.</div><div><br></div><div>Regards,</div><div><br></div><div>Xavi</div></div></div>