[Bugs] [Bug 1153629] New: AFR : excessive logging of "Non blocking entrylks failed" in glfsheal log file.
bugzilla at redhat.com
bugzilla at redhat.com
Thu Oct 16 11:25:28 UTC 2014
https://bugzilla.redhat.com/show_bug.cgi?id=1153629
Bug ID: 1153629
Summary: AFR : excessive logging of "Non blocking entrylks
failed" in glfsheal log file.
Product: GlusterFS
Version: 3.5.3
Component: replicate
Severity: high
Priority: high
Assignee: kdhananj at redhat.com
Reporter: kdhananj at redhat.com
CC: bugs at gluster.org, jbautist at redhat.com,
kdhananj at redhat.com, nlevinki at redhat.com,
spandura at redhat.com, storage-qa-internal at redhat.com,
vbellur at redhat.com
Depends On: 1055489
+++ This bug was initially created as a clone of Bug #1055489 +++
Description of problem:
=======================
The following info message is seen excessively in the glfsheal log file.
[2014-01-20 10:48:48.153464] I
[afr-self-heal-common.c:2098:afr_sh_post_nb_entrylk_missing_entry_sh_cbk]
0-vol-replicate-0: Non blocking entrylks failed.
This INFO message is of no use. This doesn't even provide information on what
directory the "Non blocking entrylks failed" .
Number of messages in the log file : 4169241 and the log file is still growing
with this message.
root at rhs-client12 [Jan-20-2014-11:28:10] >grep "Non blocking entrylks failed"
/var/log/glusterfs/glfsheal-vol.log | wc -l
4169241
root at rhs-client12 [Jan-20-2014-11:28:33] >
Version-Release number of selected component (if applicable):
=========================================================
glusterfs 3.4.0.57rhs built on Jan 13 2014 06:59:05
How reproducible:
Steps to Reproduce:
======================
1. Create a 2 x 3 distribute-replicate volume. ( 3 storage nodes with 2 bricks
on each node ). Start the volume. Set data-self-heal, metadata-self-heal,
entry-self-heal to "off".
2. Bring down brick2 and brick5 (from node2).
3. Create a fuse mount. Create files and directories ( had around 45,00,000
files in each sub-volume to self-heal )
4. Brought back the brick2 and brick5 online (service glusterd restart)
5. Executed "gluster volume heal <volume_name> info" from all the nodes
Actual results:
===================
Node1:-
--------------
root at rhs-client11 [Jan-20-2014-11:28:10] >grep "Non blocking entrylks failed"
/var/log/glusterfs/glfsheal-vol.log | wc -l
2887547
Node2 :-
-----------
root at rhs-client12 [Jan-20-2014-11:28:10] >grep "Non blocking entrylks failed"
/var/log/glusterfs/glfsheal-vol.log | wc -l
4169241
Node3 :-
------------
root at rhs-client13 [Jan-20-2014-11:28:10] >grep "Non blocking entrylks failed"
/var/log/glusterfs/glfsheal-vol.log | wc -l
1421132
Expected results:
===============
Additional info:
===================
root at rhs-client11 [Jan-20-2014-11:28:27] >gluster v info
Volume Name: vol
Type: Distributed-Replicate
Volume ID: db9d926b-63b1-448a-8d8c-9e4a9756a773
Status: Started
Number of Bricks: 2 x 3 = 6
Transport-type: tcp
Bricks:
Brick1: rhs-client11:/rhs/bricks/b1
Brick2: rhs-client12:/rhs/bricks/b1-rep1
Brick3: rhs-client13:/rhs/bricks/b1-rep2
Brick4: rhs-client11:/rhs/bricks/b2
Brick5: rhs-client12:/rhs/bricks/b2-rep1
Brick6: rhs-client13:/rhs/bricks/b2-rep2
Options Reconfigured:
cluster.entry-self-heal: off
cluster.metadata-self-heal: off
cluster.data-self-heal: off
Number of files to self-heal :
===============================
brick1:-
=======
root at rhs-client11 [Jan-20-2014- 3:45:26] >ls -l
/rhs/bricks/b2/.glusterfs/indices/xattrop/ | wc
4593810 41344283 372098553
brick3:
========
root at rhs-client13 [Jan-20-2014- 7:49:23] >ls -l
/rhs/bricks/b1-rep2/.glusterfs/indices/xattrop/ | wc
4551006 40959047 368631429
brick4:-
==========
root at rhs-client11 [Jan-20-2014-10:59:54] >find
/rhs/bricks/b2/.glusterfs/indices/xattrop/ | wc -l
4587245
brick6:-
=========
root at rhs-client13 [Jan-20-2014-10:59:54] >find
/rhs/bricks/b2-rep2/.glusterfs/indices/xattrop/ | wc -l
4587390
--- Additional comment from Krutika Dhananjay on 2014-10-13 03:50:18 EDT ---
Preliminary analysis from the bug Description:
---------------------------------------------
1) Heal info was executed from all nodes in the cluster. And glfsheal program
in turn takes locks in both sh-domain and xlator domain before determining if a
file needs healing. So it is quite likely that the glfsheal process on one of
the nodes grabs the locks while the ones running on other nodes attempt to hold
the same lock at the same time and fail.
2) The steps involved creating data on the mount while some of the bricks were
down. This means that after these nodes are brought back up, index selfheal
kicks in, which will again take locks of the same nature as the glfsheal
processes. This would mean there is lock contention between:
a) glfsheal process on node x v/s glfsheal process on node y (x != y)
b) selfheal daemon vs glfsheal processes
Will run a stripped down version of the test case (as simple as only 1 heal
info execution without any selfheal in progress) with 3.0.0, examine logs, then
add more participants into the test case, one at a time and post the results.
--
You are receiving this mail because:
You are on the CC list for the bug.
Unsubscribe from this bug https://bugzilla.redhat.com/token.cgi?t=BHaPi7Hxh1&a=cc_unsubscribe
More information about the Bugs
mailing list