[Bugs] [Bug 1229914] New: glusterfs self heal takes too long following node outage

Tue Jun 9 22:52:21 UTC 2015

https://bugzilla.redhat.com/show_bug.cgi?id=1229914

            Bug ID: 1229914
           Summary: glusterfs self heal takes too long following node
                    outage
           Product: GlusterFS
           Version: 3.7.1
         Component: core
          Severity: medium
          Priority: medium
          Assignee: bugs at gluster.org
          Reporter: pcuzner at redhat.com
                CC: bugs at gluster.org, gluster-bugs at redhat.com

Description of problem:
Using glusterfs 3.7.1 on rhel 7.1 with ovirt providing the virt layer. Taking a
node down for maintenance forces the bulk of the vms that are active to be
tracked for self heal once the node comes back.

Current mechanisms for self heal (full and diff) address the whole vdisk.

In a managed test, I had only 10 vms active and introduced ~34g of change to
the vms to measure the time take to return the environment to a consistent
state.

It took nearly 2 hours before the self heal was complete - and this was
dedicated time i.e. without further vm load/changes.

2 hours for 34g of change is too long for admins to wait and represents a big
window of opportunity for a further node to be lost sending the data into split
brain.

As I understand it all vm's will change during the outage, which means
regardless of the data change injected into the system - self heal has to look
at each and every vdisk. For example, in my test I had a data disk for the
changes - but the self heal list showed the data disks and the OS disks needed
healing.

Version-Release number of selected component (if applicable):
glusterfs 3.7.1
ovirt 3.5
rhel 7.1

How reproducible:
every time

Steps to Reproduce:
1. establish a virt environment with a 10-30 vm's (full copies not clones)
2. take one gluster/ovirt node down
3. add data to a number of the vms
4. bring the node back online
5. record the time taken for the self heal to complete
6. note the cpu consumption during diff and impact to running vm's

Actual results:
The environment had 34g of change and returned to full redundancy in 2 hours. 

Problems 
1. Even though this doesn't reflect the work undertaken, looked at
simplistically a 2 hour recovery for 34g of change is < 5MB/s!
2. cpu consumption of self heal diff is high and impacts cpu available to
running vm's. Most of the cpu consumed during self heal is within the
glusterfsd process, not glusterfs shd - and is also kernel calls (sys%) not
usr, which impacts cpu availability to the vms's
3. during self heal, even with available cpu - vdisks being healed send the
owning vm into a non-responsive state ('?' symbol). 
4. if the vm running the ovirt manager goes non-responsive - you lose
management of the cluster
5. Waiting hours before returning to full redundancy is problematic to
administration and maintenance of an ovirt/gluster environment.

Expected results:
1. full redundancy should be restored within 1/2 hour as a goal - or a specific
recovery rate should be offered (i.e. XXMB/s, with a sensible default)
2. the amount of change (self heal work) should be visible to the admin - worst
case at the cli, best case in the ovirt ui (i.e. XXGB to heal, done YYGB +
files affected)
3. self heal should not constrain cpu consumption of running vm's
4. vm's should never go to a non-responding state due to self heal

Additional info:

-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.