[Bugs] [Bug 1692093] New: Network throughput usage increased x5
bugzilla at redhat.com
bugzilla at redhat.com
Sun Mar 24 09:07:39 UTC 2019
Bug ID: 1692093
Summary: Network throughput usage increased x5
Assignee: bugs at gluster.org
Reporter: pgurusid at redhat.com
CC: amukherj at redhat.com, bengoa at gmail.com,
bugs at gluster.org, info at netbulae.com,
jsecchiero at enter.eu, nbalacha at redhat.com,
pgurusid at redhat.com, revirii at googlemail.com,
rob.dewit at coosto.com
Depends On: 1673058
Target Milestone: ---
+++ This bug was initially created as a clone of Bug #1673058 +++
Description of problem:
Client network throughput in OUT direction usage increased x5 after an upgrade
from 3.11, 3.12 to 5.3 of the server.
Now i have ~110Mbps of traffic in OUT direction for each client and on the
server i have a total of ~1450Mbps for each gluster server.
Watch the attachment for graph before/after upgrade network throughput.
Version-Release number of selected component (if applicable):
upgrade from 3.11, 3.12 to 5.3
Steps to Reproduce:
Network throughput usage increased x5
Just the features and the bugfix of the 5.3 release
2 nodes with 1 volume with 2 distributed brick for each node
Number of Peers: 1
State: Peer in Cluster (Connected)
Volume Name: storage_other
Volume ID: 6857bf2b-c97d-4505-896e-8fbc24bd16e8
Snapshot Count: 0
Number of Bricks: 2 x 2 = 4
Status of volume: storage_other
Gluster process TCP Port RDMA Port Online Pid
Brick 10.2.0.181:/mnt/storage-brick1/data 49152 0 Y 1165
Brick 10.2.0.180:/mnt/storage-brick1/data 49152 0 Y 1149
Brick 10.2.0.181:/mnt/storage-brick2/data 49153 0 Y 1166
Brick 10.2.0.180:/mnt/storage-brick2/data 49153 0 Y 1156
Self-heal Daemon on localhost N/A N/A Y 1183
Self-heal Daemon on 10.2.0.180 N/A N/A Y 1166
Task Status of Volume storage_other
There are no active volume tasks
--- Additional comment from Nithya Balachandran on 2019-02-21 07:53:44 UTC ---
Is this high throughput consistent?
Please provide a tcpdump of the client process for about 30s to 1 min during
the high throughput to see what packets gluster is sending:
In a terminal to the client machine:
tcpdump -i any -s 0 -w /var/tmp/dirls.pcap tcp and not port 22
Wait for 30s-1min and stop the capture. Send us the pcap file.
Another user reported that turning off readdir-ahead worked for him. Please try
that after capturing the statedump and see if it helps you.
--- Additional comment from Alberto Bengoa on 2019-02-21 11:17:22 UTC ---
(In reply to Nithya Balachandran from comment #1)
> Is this high throughput consistent?
> Please provide a tcpdump of the client process for about 30s to 1 min during
> the high throughput to see what packets gluster is sending:
> In a terminal to the client machine:
> tcpdump -i any -s 0 -w /var/tmp/dirls.pcap tcp and not port 22
> Wait for 30s-1min and stop the capture. Send us the pcap file.
> Another user reported that turning off readdir-ahead worked for him. Please
> try that after capturing the statedump and see if it helps you.
I'm the another user and I can confirm the same behaviour here.
On our tests we did:
- Mounted the new cluster servers (running 5.3 version) using client 5.3
- Started a find . -type d on a directory with lots of directories.
- It generated an outgoing traffic (on the client) of around 90mbps (so,
inbound traffic on gluster server).
We repeated the same test using 3.8 client (on 5.3 cluster) and the outgoing
traffic on the client was just around 1.3 mbps.
I can provide pcaps if needed.
--- Additional comment from Nithya Balachandran on 2019-02-22 04:09:41 UTC ---
Assigning this to Amar to be reassigned appropriately.
--- Additional comment from Jacob on 2019-02-25 13:42:45 UTC ---
i'm not able to upload in the bugzilla portal due to the size of the pcap.
You can download from here:
--- Additional comment from Poornima G on 2019-03-04 15:23:14 UTC ---
Disabling readdir-ahead fixed the issue?
--- Additional comment from Hubert on 2019-03-04 15:32:17 UTC ---
We seem to have the same problem with a fresh install of glusterfs 5.3 on a
debian stretch. We migrated from an existing setup (version 4.1.6,
distribute-replicate) to a new setup (version 5.3, replicate), and traffic on
clients went up significantly, maybe causing massive iowait on the clients
during high-traffic times. Here are some munin graphs:
network traffic on high iowait client:
network traffic on old servers: https://abload.de/img/oldservers-eth1nejzt.jpg
network traffic on new servers: https://abload.de/img/newservers-eth17ojkf.jpg
performance.readdir-ahead is on by default. I could deactivate it tomorrow
morning (07:00 CEST), and provide tcpdump data if necessary.
--- Additional comment from Hubert on 2019-03-05 12:03:11 UTC ---
i set performance.readdir-ahead to off and watched network traffic for about 2
hours now, but traffic is still as high. 5-8 times higher than it was with old
just curious: i see hundreds of thousands of these messages:
[2019-03-05 12:02:38.423299] W [dict.c:761:dict_ref]
[0x7f0dbb7e4a38] ) 5-dict: dict is NULL [Invalid argument]
see https://bugzilla.redhat.com/show_bug.cgi?id=1674225 - could this be
--- Additional comment from Jacob on 2019-03-06 09:54:26 UTC ---
Disabling readdir-ahead doesn't change the througput.
--- Additional comment from Alberto Bengoa on 2019-03-06 10:07:59 UTC ---
Neither to me.
BTW, read-ahead/readdir-ahead shouldn't generate traffic in the opposite
direction? ( Server -> Client)
--- Additional comment from Nithya Balachandran on 2019-03-06 11:40:49 UTC ---
(In reply to Jacob from comment #4)
> i'm not able to upload in the bugzilla portal due to the size of the pcap.
> You can download from here:
the following are the calls and instances from the above:
104 proc-1 (stat)
8259 proc-11 (open)
46 proc-14 (statfs)
8239 proc-15 (flush)
8 proc-18 (getxattr)
68 proc-2 (readlink)
5576 proc-27 (lookup)
8388 proc-41 (forget)
Not sure if it helps.
--- Additional comment from Hubert on 2019-03-07 08:34:21 UTC ---
i made a tcpdump as well:
tcpdump -i eth1 -s 0 -w /tmp/dirls.pcap tcp and not port 2222
tcpdump: listening on eth1, link-type EN10MB (Ethernet), capture size 262144
259699 packets captured
259800 packets received by filter
29 packets dropped by kernel
The file is 1.1G big; gzipped and uploaded it: https://ufile.io/5h6i2
Hope this helps.
--- Additional comment from Hubert on 2019-03-07 09:00:12 UTC ---
Maybe i should add that the relevant IP addresses of the gluster servers are:
192.168.0.50, 192.168.0.51, 192.168.0.52
--- Additional comment from Hubert on 2019-03-18 13:45:51 UTC ---
fyi: on a test setup (debian stretch, after upgrade 5.3 -> 5.5) i did a little
- copied 11GB of data
- via rsync: rsync --bwlimit=10000 --inplace --- bandwith limit of max. 10000
- rsync pulled data over interface eth0
- rsync stats: sent 1,484,200 bytes received 11,402,695,074 bytes
- so external traffic average was about 5 MByte/s
- result was an internal traffic up to 350 MBit/s (> 40 MByte/s) on eth1 (LAN
- graphic of internal traffic:
- graphic of external traffic:
--- Additional comment from Poornima G on 2019-03-19 06:15:50 UTC ---
Apologies for the delay, there have been some changes done to quick-read
feature, which deals with reading the content of a file in lookup fop, if the
file is smaller than 64KB. I m suspecting that with 5.3 the increase in
bandwidth may be due to more number of reads of small file(generated by
Please try the following:
gluster vol set <volname> quick-read off
gluster vol set <volname> read-ahead off
gluster vol set <volname> io-cache off
And let us know if the network bandwidth consumption decreases, meanwhile i
will try to reproduce the same locally.
--- Additional comment from Hubert on 2019-03-19 08:12:04 UTC ---
I deactivated the 3 params and did the same test again.
- same rsync params: rsync --bwlimit=10000 --inplace
- rsync stats: sent 1,491,733 bytes received 11,444,330,300 bytes
- so ~6,7 MByte/s or ~54 MBit/s in average (peak of 60 MBit/s) over external
- traffic graphic of the server with rsync command:
- so server is sending with an average of ~110 MBit/s and with peak at ~125
MBit/s over LAN interface
- traffic graphic of one of the replica servers (disregard first curve: is the
delete of the old data): https://abload.de/img/if_enp5s0-internal-trn5k9v.png
- so one of the replicas receices data with ~55 MBit/s average and peak ~62
- as a comparison - traffic before and after changing the 3 params (rsync
server, highest curve is relevant):
So it looks like the traffic was reduced to about a third. Is it this what you
If so: traffic would be still a bit higher when i compare 4.1.6 and 5.3 -
here's a graphic of one client in our live system after switching from 4.1.6
(~20 MBit/s) to 5.3. (~100 MBit/s in march):
So if this traffic gets reduced to 1/3: traffic would be ~33 MBit/s then. Way
better, i think. And could be "normal"?
Thx so far :-)
--- Additional comment from Poornima G on 2019-03-19 09:23:48 UTC ---
Awesome thank you for trying it out, i was able to reproduce this issue
locally, one of the major culprit was the quick-read. The other two options had
no effect in reducing the bandwidth consumption. So for now as a workaround,
can disable quick-read:
# gluster vol set <volname> quick-read off
Quick-read alone reduced the bandwidth consumption by 70% for me. Debugging the
rest 30% increase. Meanwhile, planning to make this bug a blocker for our next
Will keep the bug updated with the progress.
--- Additional comment from Hubert on 2019-03-19 10:07:35 UTC ---
i'm running another test, just alongside... simply deleting and copying data,
no big effort. Just curious :-)
2 little questions:
- does disabling quick-read have any performance issues for certain
- bug only blocker for v6 release? update for v5 planned?
--- Additional comment from Poornima G on 2019-03-19 10:36:20 UTC ---
(In reply to Hubert from comment #17)
> i'm running another test, just alongside... simply deleting and copying
> data, no big effort. Just curious :-)
I think if the volume hosts small files, then any kind of operation around
these files will see increased bandwidth usage.
> 2 little questions:
> - does disabling quick-read have any performance issues for certain
Small file reads(files with size <= 64kb) will see reduced performance. Eg: web
server use case.
> - bug only blocker for v6 release? update for v5 planned?
Yes there will be updated for v5, not sure when. The updates for major releases
are made once in every 3 or 4 weeks not sure. For critical bugs the release
will be made earlier.
--- Additional comment from Alberto Bengoa on 2019-03-19 11:54:58 UTC ---
Thanks for your update Poornima.
I was already running quick-read off here so, on my case, I noticed the traffic
growing consistently after enabling it.
I've made some tests on my scenario, and I wasn't able to reproduce your 70%
reduction results. To me, it's near 46% of traffic reduction (from around 103
Mbps to around 55 Mbps, graph attached here: https://pasteboard.co/I68s9qE.png
What I'm doing is just running a find . type -d on a directory with loads of
Poornima, if you don't mind to answer a question, why are we seem this traffic
on the inbound of gluster servers (outbound of clients)? On my particular case,
the traffic should be basically on the opposite direction I think, and I'm very
curious about that.
--- Additional comment from Poornima G on 2019-03-22 17:42:54 UTC ---
Thank You all for the report. We have the RCA, working on the patch will be
posting it shortly.
The issue was with the size of the payload being sent from the client to server
for operations like lookup and readdirp. Hence worakload involving lookup and
readdir would consume a lot of bandwidth.
[Bug 1673058] Network throughput usage increased x5
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.
More information about the Bugs