[Gluster-devel] AFR Heal Bug

Sun Dec 30 19:09:56 UTC 2007

Ok, I'm going to call it a bug, tell me if I'm wrong .. :) 

(two servers, both define a "homes" volume) 

Client; 

volume nodea-homes 
type protocol/client 
option transport-type tcp/client 
option remote-host nodea 
option remote-subvolume homes 
end-volume 

volume nodeb-homes 
type protocol/client 
option transport-type tcp/client 
option remote-host nodeb 
option remote-subvolume homes 
end-volume 

volume homes-afr 
type cluster/afr 
subvolumes nodea-homes nodeb-homes ### ISSUE IS HERE! ### 
option scheduler rr 
end-volume 

Assume system is completely up-to-date and working Ok. 
Mount homes filesystem on "client". 
Kill the "nodea" server. 
System carries on, effectively using nodeb. 

Wipe nodea's physical volume. 
Restart nodea server. 

All of a sudden, "client" see's an empty "homes" filesystem, although data is still in place on "B" and "A" is blank. 
i.e. the client is seeing the blank "nodea" only (!) 

.. at this point you check nodeb to make sure your data really is there, then you can mop up the coffee you've just spat all over your screens .. 

If you crash nodeB instead, there appears to be no problem, and a self heal "find" will correct the blank volume. 
Alternatively, if you reverse the subvolumes as listed above, you don't see the problem. 

The issue appears to be blanking the first subvolume. 

I'm thinking the order of the volumes should not be an issue, gluster should know one volume is empty / new and one contains real data and act accordingly, rather than relying on the order volumes are listed .. (???) 

I'm using fuse glfs7 and gluster 1.3.8 (tla).