[Bugs] [Bug 1408820] New: [Arbiter] After Killing a brick writes drastically slow down

bugzilla at redhat.com bugzilla at redhat.com
Tue Dec 27 12:45:55 UTC 2016


https://bugzilla.redhat.com/show_bug.cgi?id=1408820

            Bug ID: 1408820
           Summary: [Arbiter] After Killing a brick writes drastically
                    slow down
           Product: GlusterFS
           Version: 3.7.18
         Component: arbiter
          Severity: high
          Assignee: bugs at gluster.org
          Reporter: ravishankar at redhat.com
                CC: bugs at gluster.org
        Depends On: 1408112, 1408395
            Blocks: 1408770, 1408772



+++ This bug was initially created as a clone of Bug #1408395 +++

+++ This bug was initially created as a clone of Bug #1408112 +++

Description of problem:
When both the bricks are up writing is at optimal speed and after killing a
data brick the writes drastically slow down. 

Version-Release number of selected component (if applicable):
Gluster version:- 3.8.4-9

How reproducible:
100%
Logs and Volume profiles are placed at 
 rhsqe-repo.lab.eng.blr.redhat.com:/var/www/html/sosreports/<bug>

Steps to Reproduce:
1. To compare create a 1*(2+1) arbiter volume
2. Now write 2 gigs of data using FIO with below command 
    fio /randomwritejob.ini  --client=/clients.list
3. now kill a data brick and then write the same data using fio 
  writing 2 gigs of data takes very long time to complete.

Expected results:
There should be no difference in writting same data in both scenario.

Additional info:
[root at dhcp46-206 /]# vim /randomwritejob.ini
[root at dhcp46-206 /]# cat /randomwritejob.ini
[global]
rw=randrw
io_size=1g
fsync_on_close=1
size=1g
bs=64k
rwmixread=20
openfiles=1
startdelay=0
ioengine=sync
verify=md5
[write]
directory=/mnt/samsung
nrfiles=1
filename_format=f.$jobnum.$filenum
numjobs=2
[root at dhcp46-206 /]#


--- Additional comment from Karan Sandha on 2016-12-23 02:43:36 EST ---

Tested the above test steps on Replica 2 and Replica 3. Seems like this issue
is specific to arbiter.

Thanks & Regards
Karan Sandha

--- Additional comment from Ravishankar N on 2016-12-23 04:21:45 EST ---

RCA:
afr_replies_interpret() used the 'readable' matrix to trigger client
side heals after inode refresh. But for arbiter, readable is always
zero. So when `dd` is run with a data brick down, spurious data heals
are are triggered repeatedly. These heals open an fd, causing eager lock to be
disabled (open fd count >1) in afr transactions, leading to extra LOCK +
FXATTROPS, slowing the throughput.

--- Additional comment from Worker Ant on 2016-12-23 04:36:42 EST ---

REVIEW: http://review.gluster.org/16277 (afr: use accused matrix instead of
readable matrix for deciding heals) posted (#1) for review on master by
Ravishankar N (ravishankar at redhat.com)

--- Additional comment from Worker Ant on 2016-12-27 01:34:05 EST ---

COMMIT: http://review.gluster.org/16277 committed in master by Pranith Kumar
Karampuri (pkarampu at redhat.com) 
------
commit 5a7c86e578f5bbd793126a035c30e6b052177a9f
Author: Ravishankar N <ravishankar at redhat.com>
Date:   Fri Dec 23 07:11:13 2016 +0000

    afr: use accused matrix instead of readable matrix for deciding heals

    Problem:
    afr_replies_interpret() used the 'readable' matrix to trigger client
    side heals after inode refresh. But for arbiter, readable is always
    zero. So when `dd` is run with a data brick down, spurious data heals
    are are triggered. These heals open an fd, causing eager lock to be
    disabled (open fd count >1) in afr transactions, leading to extra FXATTROPS

    Fix:
    Use the accused matrix (derived from interpreting the afr pending
    xattrs) to decide whether we can start heal or not.

    Change-Id: Ibbd56c9aed6026de6ec42422e60293702aaf55f9
    BUG: 1408395
    Signed-off-by: Ravishankar N <ravishankar at redhat.com>
    Reviewed-on: http://review.gluster.org/16277
    NetBSD-regression: NetBSD Build System <jenkins at build.gluster.org>
    CentOS-regression: Gluster Build System <jenkins at build.gluster.org>
    Smoke: Gluster Build System <jenkins at build.gluster.org>
    Reviewed-by: Pranith Kumar Karampuri <pkarampu at redhat.com>
    Tested-by: Pranith Kumar Karampuri <pkarampu at redhat.com>


Referenced Bugs:

https://bugzilla.redhat.com/show_bug.cgi?id=1408112
[Bug 1408112] [Arbiter] After Killing a brick writes drastically slow down
https://bugzilla.redhat.com/show_bug.cgi?id=1408395
[Bug 1408395] [Arbiter] After Killing a brick writes drastically slow down
https://bugzilla.redhat.com/show_bug.cgi?id=1408770
[Bug 1408770] [Arbiter] After Killing a brick writes drastically slow down
https://bugzilla.redhat.com/show_bug.cgi?id=1408772
[Bug 1408772] [Arbiter] After Killing a brick writes drastically slow down
-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.


More information about the Bugs mailing list