[Bugs] [Bug 1259572] New: client is sending io to arbiter with replica 2
bugzilla at redhat.com
bugzilla at redhat.com
Thu Sep 3 05:33:12 UTC 2015
https://bugzilla.redhat.com/show_bug.cgi?id=1259572
Bug ID: 1259572
Summary: client is sending io to arbiter with replica 2
Product: GlusterFS
Version: mainline
Component: replicate
Keywords: Triaged
Severity: urgent
Assignee: bugs at gluster.org
Reporter: ravishankar at redhat.com
CC: bugs at gluster.org, gluster-bugs at redhat.com,
ravishankar at redhat.com, sdainard at spd1.com,
ueberall at projektzentrisch.de
Depends On: 1255110
+++ This bug was initially created as a clone of Bug #1255110 +++
Description of problem:
Using a replica 2 + arbiter 1 configuration for ovirt storage domain. When
arbiter node is up client IO is decreased by ~%30. Monitoring bandwidth on
arbiter node shows significant rx network, in this case 50-60MB/s but the brick
local path on the arbiter node is not showing significant disk space usage (<
1MB).
Version-Release number of selected component (if applicable):
CentOS 6.7 / 7.1
Gluster 3.7.3
How reproducible:
Always. If Arbiter node is killed, client IO is higher.
Initially discovered using ovirt, but also easily reproduced by writing to
client fuse mount point.
Steps to Reproduce:
1. Create gluster volume replica 2 arbiter 1
2. Write data on client fuse mount point
3. Watch realtime network bandwidth on arbiter node
Actual results:
Client is sending IO writes to arbiter node, decreasing expected performance.
Expected results:
No heavy IO should be going to the arbiter node, as it has no reason to receive
data when it doesn't have any storage bricks. This considerably slows client IO
as it is writing to 3 nodes instead of two. I would assume this is the same
performance penalty as replica 3 would be vs replica 2.
Additional info:
During a disk migration from an NFS storage domain to a gluster storage domain,
the arbiter node interface shows 37GB of data received:
eth1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 10.0.231.62 netmask 255.255.255.0 broadcast 10.0.231.255
inet6 fe80::5054:ff:fe61:a934 prefixlen 64 scopeid 0x20<link>
ether 52:54:00:61:a9:34 txqueuelen 1000 (Ethernet)
RX packets 5874053 bytes 39820122925 (37.0 GiB)
RX errors 0 dropped 650 overruns 0 frame 0
TX packets 4793230 bytes 4387154708 (4.0 GiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
BUT the arbiter node has a very little storage space available to it ('brick'
mount point is on /):
# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/vda3 8.6G 1.2G 7.4G 14% /
devtmpfs 912M 0 912M 0% /dev
tmpfs 921M 0 921M 0% /dev/shm
tmpfs 921M 8.4M 912M 1% /run
tmpfs 921M 0 921M 0% /sys/fs/cgroup
/dev/vda1 497M 157M 341M 32% /boot
--- Additional comment from Steve D on 2015-08-26 16:42:56 EDT ---
Furthermore,
If I do some write tests to a fuse mount point for the volume I get:
Active nodes: node1, node2, arbiter:
# dd if=/dev/zero of=test10 bs=1M count=200 oflag=direct
^C34+0 records in
34+0 records out
35651584 bytes (36 MB) copied, 33.394 s, 1.1 MB/s
Active nodes: node1, node2:
# dd if=/dev/zero of=test10 bs=1M count=200 oflag=direct
200+0 records in
200+0 records out
209715200 bytes (210 MB) copied, 3.19871 s, 65.6 MB/s
Active nodes: node2, arbiter
# dd if=/dev/zero of=test10 bs=1M count=200 oflag=direct
200+0 records in
200+0 records out
209715200 bytes (210 MB) copied, 5.86619 s, 35.7 MB/s
# dd if=/dev/zero of=test10 bs=1M count=200 oflag=direct
200+0 records in
200+0 records out
209715200 bytes (210 MB) copied, 7.74412 s, 27.1 MB/s
# dd if=/dev/zero of=test10 bs=1M count=200 oflag=direct
200+0 records in
200+0 records out
209715200 bytes (210 MB) copied, 6.77487 s, 31.0 MB/s
--- Additional comment from Ravishankar N on 2015-08-27 05:21:44 EDT ---
Hi Steve,
1. You're correct on the observation that we send the writes even to the brick
process of the arbiter node (even though the writes are not written to the
disk). But we need the write to be sent to the brick for things to work. (AFR
changelog xattrs depend on this). What we could do is send only one byte to the
arbiter instead of the entire data. I'll work on the fix.
2. FWIW, I'm not able to recreate the drastic difference in 'dd' throughputs
(1.1 vs 65,6 MB/s ?) as described in comment #1. I notice only marginal
difference in my test setup. Can you check if you're hitting the same behaviour
on a normal replica 3 volume?
--- Additional comment from Steve D on 2015-09-02 15:53:37 EDT ---
Sorry took a while to get back to this. My arbiter node was a VM before, I
decided to get another physical host running and added some storage.
I should also mention all my bricks are SSD's, net is 10gig.
I added a 3rd disk, replica 3 no arbiter:
Fuse mount point:
dd if=/dev/zero of=test10 bs=1M count=200 oflag=direct
200+0 records in
200+0 records out
209715200 bytes (210 MB) copied, 0.360235 s, 582 MB/s
I then rebuilt the volume with replica 3 arbiter 1:
Fuse mount point:
dd if=/dev/zero of=test bs=1M count=200 oflag=direct
200+0 records in
200+0 records out
209715200 bytes (210 MB) copied, 2.75525 s, 76.1 MB/s
Obviously there was some performance impact from the virt env, although I have
no idea what that would be. But in either case there is a significant gap here.
Also I'm noticing that an Ovirt VM backed by rep3/arbiter 1 gets significantly
less write speed:
dd if=/dev/zero of=test bs=1M count=200 oflag=direct
200+0 records in
200+0 records out
209715200 bytes (210 MB) copied, 16.5165 s, 12.7 MB/s
In replica 3 no arbiter, the VM gets:
dd if=/dev/zero of=test bs=1M count=200 oflag=direct
200+0 records in
200+0 records out
209715200 bytes (210 MB) copied, 0.392785 s, 534 MB/s
--- Additional comment from Steve D on 2015-09-02 16:15:25 EDT ---
Forgot to mention throughput when arbiter node is down inside a VM is very
similar to the fuse mount point:
VM:
dd if=/dev/zero of=test bs=1M count=200 oflag=direct
200+0 records in
200+0 records out
209715200 bytes (210 MB) copied, 2.80659 s, 74.7 MB/s
fuse:
dd if=/dev/zero of=test bs=1M count=200 oflag=direct
200+0 records in
200+0 records out
209715200 bytes (210 MB) copied, 2.55501 s, 82.1 MB/s
--- Additional comment from Vijay Bellur on 2015-09-03 00:28:14 EDT ---
REVIEW: http://review.gluster.org/12095 (afr: Do not wind the full writev
payload to arbiter brick) posted (#1) for review on master by Ravishankar N
(ravishankar at redhat.com)
Referenced Bugs:
https://bugzilla.redhat.com/show_bug.cgi?id=1255110
[Bug 1255110] client is sending io to arbiter with replica 2
--
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.
More information about the Bugs
mailing list