[Bugs] [Bug 1408770] New: [Arbiter] After Killing a brick writes drastically slow down
bugzilla at redhat.com
bugzilla at redhat.com
Tue Dec 27 06:39:28 UTC 2016
https://bugzilla.redhat.com/show_bug.cgi?id=1408770
Bug ID: 1408770
Summary: [Arbiter] After Killing a brick writes drastically
slow down
Product: GlusterFS
Version: 3.9
Component: arbiter
Severity: high
Assignee: bugs at gluster.org
Reporter: ravishankar at redhat.com
CC: bugs at gluster.org
Depends On: 1408112, 1408395
+++ This bug was initially created as a clone of Bug #1408395 +++
+++ This bug was initially created as a clone of Bug #1408112 +++
Description of problem:
When both the bricks are up writing is at optimal speed and after killing a
data brick the writes drastically slow down.
Version-Release number of selected component (if applicable):
Gluster version:- 3.8.4-9
How reproducible:
100%
Logs and Volume profiles are placed at
rhsqe-repo.lab.eng.blr.redhat.com:/var/www/html/sosreports/<bug>
Steps to Reproduce:
1. To compare create a 1*(2+1) arbiter volume
2. Now write 2 gigs of data using FIO with below command
fio /randomwritejob.ini --client=/clients.list
3. now kill a data brick and then write the same data using fio
writing 2 gigs of data takes very long time to complete.
Expected results:
There should be no difference in writting same data in both scenario.
Additional info:
[root at dhcp46-206 /]# vim /randomwritejob.ini
[root at dhcp46-206 /]# cat /randomwritejob.ini
[global]
rw=randrw
io_size=1g
fsync_on_close=1
size=1g
bs=64k
rwmixread=20
openfiles=1
startdelay=0
ioengine=sync
verify=md5
[write]
directory=/mnt/samsung
nrfiles=1
filename_format=f.$jobnum.$filenum
numjobs=2
[root at dhcp46-206 /]#
--- Additional comment from Karan Sandha on 2016-12-23 02:43:36 EST ---
Tested the above test steps on Replica 2 and Replica 3. Seems like this issue
is specific to arbiter.
Thanks & Regards
Karan Sandha
--- Additional comment from Ravishankar N on 2016-12-23 04:21:45 EST ---
RCA:
afr_replies_interpret() used the 'readable' matrix to trigger client
side heals after inode refresh. But for arbiter, readable is always
zero. So when `dd` is run with a data brick down, spurious data heals
are are triggered repeatedly. These heals open an fd, causing eager lock to be
disabled (open fd count >1) in afr transactions, leading to extra LOCK +
FXATTROPS, slowing the throughput.
--- Additional comment from Worker Ant on 2016-12-23 04:36:42 EST ---
REVIEW: http://review.gluster.org/16277 (afr: use accused matrix instead of
readable matrix for deciding heals) posted (#1) for review on master by
Ravishankar N (ravishankar at redhat.com)
--- Additional comment from Worker Ant on 2016-12-27 01:34:05 EST ---
COMMIT: http://review.gluster.org/16277 committed in master by Pranith Kumar
Karampuri (pkarampu at redhat.com)
------
commit 5a7c86e578f5bbd793126a035c30e6b052177a9f
Author: Ravishankar N <ravishankar at redhat.com>
Date: Fri Dec 23 07:11:13 2016 +0000
afr: use accused matrix instead of readable matrix for deciding heals
Problem:
afr_replies_interpret() used the 'readable' matrix to trigger client
side heals after inode refresh. But for arbiter, readable is always
zero. So when `dd` is run with a data brick down, spurious data heals
are are triggered. These heals open an fd, causing eager lock to be
disabled (open fd count >1) in afr transactions, leading to extra FXATTROPS
Fix:
Use the accused matrix (derived from interpreting the afr pending
xattrs) to decide whether we can start heal or not.
Change-Id: Ibbd56c9aed6026de6ec42422e60293702aaf55f9
BUG: 1408395
Signed-off-by: Ravishankar N <ravishankar at redhat.com>
Reviewed-on: http://review.gluster.org/16277
NetBSD-regression: NetBSD Build System <jenkins at build.gluster.org>
CentOS-regression: Gluster Build System <jenkins at build.gluster.org>
Smoke: Gluster Build System <jenkins at build.gluster.org>
Reviewed-by: Pranith Kumar Karampuri <pkarampu at redhat.com>
Tested-by: Pranith Kumar Karampuri <pkarampu at redhat.com>
Referenced Bugs:
https://bugzilla.redhat.com/show_bug.cgi?id=1408112
[Bug 1408112] [Arbiter] After Killing a brick writes drastically slow down
https://bugzilla.redhat.com/show_bug.cgi?id=1408395
[Bug 1408395] [Arbiter] After Killing a brick writes drastically slow down
--
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.
More information about the Bugs
mailing list