[Gluster-devel] Regression failure of tests/basic/afr/data-self-heal.t

Wed May 6 04:26:09 UTC 2015

TL;DR: Need to come up with a fix for AFR data self-heal from clients 
(mounts).

/data-self-heal.t/  creates a 1x2 volume, sets  afr changelog xattrs 
directly on the files in the backend bricks, then runs full heal to heal 
the files.

The test fails intermittently when run in a loop because data self-heal 
attempts  non-blocking locks  before healing and the two heal threads 
(one per brick) might try to acquire the lock at the same time and both 
might fail. In afr-v1, only one thread gets spawned if both bricks are 
in the same node. In afr-v2, we cannot do this because unlike in v1, 
there is no conservative merge in afr_opendir_cbk() in v2. We are not 
sure that adding conservative merge in v2 is a good idea because it 
involves (multiple ) readdirs on both bricks and computing checksum on 
the entries to see if there is a mismatch, which can be a costly 
operation when done from clients. Making the locks blocking could cause 
one heal thread to block instead of trying to heal other files if the 
other thread holds the lock.  One approach is to do what ec does by 
using a virtual xattr and handling it in the getxattr FOP to trigger 
data heals from clients. More thought needs to be given to this.

Regards,
Ravi

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.gluster.org/pipermail/gluster-devel/attachments/20150506/df8a69f7/attachment.html>