[Bugs] [Bug 1285634] New: Self-heal triggered every couple of seconds and a 3-node 1-arbiter setup

Thu Nov 26 06:08:51 UTC 2015

https://bugzilla.redhat.com/show_bug.cgi?id=1285634

            Bug ID: 1285634
           Summary: Self-heal triggered every couple of seconds and a
                    3-node 1-arbiter setup
           Product: GlusterFS
           Version: mainline
         Component: unclassified
          Keywords: Triaged
          Severity: medium
          Priority: high
          Assignee: bugs at gluster.org
          Reporter: ravishankar at redhat.com
                CC: adrian.gruntkowski at gmail.com, bugs at gluster.org,
                    gluster-bugs at redhat.com, mselvaga at redhat.com,
                    pkarampu at redhat.com, ravishankar at redhat.com
        Depends On: 1283956

+++ This bug was initially created as a clone of Bug #1283956 +++

Description of problem:

I have a 3 node setup with 1 arbiter brick for every volume. Every volume
contains a couple of big files with KVM disk images.

Every couple of minutes/seconds (probably depends on activity), a self-heal
operation is triggered on one or more files on the volumes. During that time,
there's no noticable loss of connectivity or anything like that.

How reproducible:

Run "gluster volume heal volume-name info" a couple of times and observe the
output.

Actual results:

root at web-vm:~# gluster volume heal system_www1 info
Brick cluster-rep:/GFS/system/www1
/images/101/vm-101-disk-1.qcow2 - Possibly undergoing heal

Number of entries: 1

Brick web-rep:/GFS/system/www1
/images/101/vm-101-disk-1.qcow2 - Possibly undergoing heal

Number of entries: 1

Brick mail-rep:/GFS/system/www1
/images/101/vm-101-disk-1.qcow2 - Possibly undergoing heal

Number of entries: 1

Expected results:

Heal not being triggered without a reason.

Additional info:

Setting "cluster.self-heal-daemon" to "off" on the volumes does not change the
behavior.

--- Additional comment from Adrian Gruntkowski on 2015-11-20 06:41 EST ---

--- Additional comment from Ravishankar N on 2015-11-20 07:07:51 EST ---

Had a quick look at one of the mount logs for the 'system_ww1 volume', i.e.
glusterfs_cluster-vm/mnt-pve-system_www1.log.1 where I do see disconnects to
the bricks.

#grep -rne "disconnected from" mnt-pve-system_www1.log.1|tail -n3
2177:[2015-11-19 15:58:32.687248] I [MSGID: 114018]
[client.c:2042:client_rpc_notify] 0-system_www1-client-0: disconnected from
system_www1-client-0. Client process will keep trying to connect to glusterd
until brick's port is available
2283:[2015-11-19 15:58:43.486658] I [MSGID: 114018]
[client.c:2042:client_rpc_notify] 0-system_www1-client-0: disconnected from
system_www1-client-0. Client process will keep trying to connect to glusterd
until brick's port is available
2385:[2015-11-19 15:58:43.557338] I [MSGID: 114018]
[client.c:2042:client_rpc_notify] 0-system_www1-client-2: disconnected from
system_www1-client-2. Client process will keep trying to connect to glusterd
until brick's port is available

So it appears that there are network disconnects from the mount to the bricks.
If I/O was happening during the disconnects, when the connection is
re-established, self-heal woll get triggered.

Adrian, could you confirm like you said on IRC if it could be an issue with
your firewall/ network? If yes, Ill close it as NOTABUG

--- Additional comment from Ravishankar N on 2015-11-20 07:11:14 EST ---

(In reply to Adrian Gruntkowski from comment #0)

> Setting "cluster.self-heal-daemon" to "off" on the volumes does not change
> the behavior.

Clients (mounts) can also trigger self-heals in addition to the self-heal
daemon. If you want to disable client side heal, you need to set
cluster.metadata-self-heal, cluster.data-self-heal and cluster.entry-self-heal
to off.

--- Additional comment from Adrian Gruntkowski on 2015-11-20 08:06:01 EST ---

The entries that you mentioned have timestamps from yesterday. I did a couple
of server restarts and was fiddling applying the patch and so forth. The logs
look clean for today in that regard.

I have double checked the logs for interface flapping and firewall rules but
everything seems fine. The pings on the interfaces dedicated to gluster between
the nodes go through without any losses.

Ravishankar: Sure, I was changing that setting in the course of experiment that
Pranith wanted me to do. Just mentioned it for completeness.

--- Additional comment from Pranith Kumar K on 2015-11-24 09:22:00 EST ---

hi Adrian,
      I looked at the pcap files and found nothing unusual. So I think we are
left with trying to re-create the problem. Do you think we can come up with a
way to recreate this problem consistently?

Pranith

--- Additional comment from Adrian Gruntkowski on 2015-11-24 09:30:06 EST ---

My setup is pretty basic, save for crossover configuration of 2 sets of
volumes.
I have actually laid it out in the initial post on ML about the issue:

http://www.gluster.org/pipermail/gluster-users/2015-October/024078.html

For the time being, I'm rolling back to a 2-node setup. I will also try to
setup a cluster with arbiter in a local test env on virtualbox based VMs.

Adrian

--- Additional comment from Pranith Kumar K on 2015-11-24 11:11:32 EST ---

Adrian,
     So you don't see this without Arbiter?

Pranith

--- Additional comment from Adrian Gruntkowski on 2015-11-25 04:00:23 EST ---

Yes, I see it only when in arbiter setup.

Adrian

--- Additional comment from Vijay Bellur on 2015-11-26 00:01:57 EST ---

REVIEW: http://review.gluster.org/12755 (cluster/afr: change data self-heal
size check for arbiter) posted (#1) for review on master by Pranith Kumar
Karampuri (pkarampu at redhat.com)

--- Additional comment from Pranith Kumar K on 2015-11-26 00:02:43 EST ---

Adrian,
    I was able to recreate this issue.

Steps to recreate:
1) Create a volume with arbiter, start the volume and mount the volume
2) On the mount point execute "dd if=/dev/zero of=a.txt"
3) While the command above is running, execute "gluster volume heal <volname>
info" in a loop. We will see pending entries to be healed.

With the patch in https://bugzilla.redhat.com/show_bug.cgi?id=1283956#c9
I don't see the issue anymore. Let me know how your testing goes with this
patch.

Pranith

Referenced Bugs:

https://bugzilla.redhat.com/show_bug.cgi?id=1283956
[Bug 1283956] Self-heal triggered every couple of seconds and a 3-node
1-arbiter setup
-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.